php - Illegal non-standard quotes in XML
I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this
”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.
Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.
EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:
Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:
There may also be a UTF-8 option in the parser's API.
Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!
Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.
Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".
If you need to get rid of them, a simple global replace using a text editor will do the job fine.
But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).
If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:
For me gives:
If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.
Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".
These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.
Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.