php - Illegal non-standard quotes in XML

434

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.

Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.

EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:

<SomeTag>User’s Input</SomeTag>
76

Answer

Solution:

Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:

<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.

Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!

Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.

703

Answer

Solution:

Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".

If you need to get rid of them, a simple global replace using a text editor will do the job fine.

But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).

959

Answer

Solution:

If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:

$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:

&rdquo;&rsquo;

whereas

$html = htmlentities( '”’' );
echo $html;

gets confused:

&acirc;??&acirc;??

If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.

149

Answer

Solution:

Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.

331

People are also looking for solutions to the problem: php - How can I correct this error: Data source name not found and no default driver specified

Source

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

Similar questions

Find the answer in similar questions on our website.