PHP String Type Encoding

base64_encode(), base64_decode(), chunk_split(), convert_uuencode() and RFC 2045 section 6.8 are topics which I may want to investigate if I use strings to hold binary data (rather than character data). They are about putting binary data into the lower bits of a byte for the purpose of transmitting them intact through a network or protocol interface. For example lets say you want to place the binary of a picture in the body of an email. The safest way is to encode it into some form of text. Then decode it on the end where the email message is received. See this post.

For the rest of this post we will assume we are talking about character strings.

Generally speaking PHP thinks of strings as ASCII or UTF-8. However, most browsers, text editors and some APIs think of text as ISO-8859-1 aka Latin-1. Either way a PHP string is just a sequence of single bytes.

Note: A one byte UTF-8 character is ASCII encoded.

One thing to take into consideration when using UTF-8 with PHP is that some PHP functions do not take into account that some characters use multiple bytes; therefore, these functions will not work as expected.

PHP 6 will be using IBM’s ICU libraries to provide native support for character sets. This is, in general, very good news and brings PHP on a par with Java in this area.

In my post about encoding I discussed the problem of having UTF-8 strings being outputted to ISO-8859-1 aka Latin-1 browsers.

In my post about encoding I discussed the problem of form data coming from an ISO-8859-1 aka Latin-1 browser.

ISO-8859-1 is an 8-bit encoding scheme; for UTF-8 the first byte is encoded using a 7-bit encoding scheme; So, in the range above 127 the ISO-8859-1 characters will be encoded differently from their UTF-8 counterparts;

ISO-8859-1 is incompatible with UTF-8 even if characters have a length of one byte.

Both PHP4 + PHP5 xml-dom extensions use UTF-8 as internal encoding. This means that they mostly get it right, however there is one major GOTCHA, since they expect input strings to be utf8-encoded. If you use iso-8859-1 as your internal encoding (which you most likely do), this means that each and every string that you input to the DOM api should be encoded with utf8_encode(). It’s important to realize that you have to do this regardless of which encoding the document is out in.

utf8_decode() converts UTF-8 to ISO-8859-1 aka Latin-1. Use it when you are taking strings from a database or API which uses UTF-8 and moving it to a browser or API which uses ISO-8859-1 strings. Caution: The UTF-8 character set has many characters not available in ISO-8859-1.

iconv() is much better than utf8_decode() for many reasons. For example to convert the euro. Also, if you need to convert any text from any encoding to any other encoding, look at iconv() instead.

Read post: character encoding for database to learn about practices.

See: valid characters for my strings


About samehramzylabib

See About on
This entry was posted in PHP String and tagged , , , , , , , . Bookmark the permalink.


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s