php - Replacing non UTF8 characters

273

In php, I need to replace all non-UTF8 characters in a string. However, not by some equivalent (like theiconv function with//TRANSLIT) but by some chosen character (like"_" or"*" for example).

Typically I want the user to be able to see the position were the invalid characters were found.

I didn't find any functions that do this, so I was going to use:

  • useiconv with//IGNORE
  • do a diff on the two strings and insert the wanted character where the non-UTF8 ones where

Do you see a better way to do that, is there some functions in php that can be combined to have this behavior ?

Thanks for you help.

96

Answer

Solution:

Here are 2 functions to help you achieve something close to what you want :

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

note that you can change the replacement (which currently is '?' with anything else by changing the string located atpreg_replace('blablabla', **'?'**, $some_string)

the original article : http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/

People are also looking for solutions to the problem: html - Send Password using Ajax to PHP

Source

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

Similar questions

Find the answer in similar questions on our website.