character encoding

Note to self: On gxsam11.net I have a page called “Character Encoding”. It is at > gxsam11.net > How To > PHP > input, output, file system > character encoding.

Characters, Encoding and PHP

Understanding encoding was a simple thing when I was a Computer Science student studying C++ in the nineties. There were strings and then there were characters. While handling a string you always knew (based on the encoding standard or the variable type involved) if the characters in it were one byte characters, two byte characters, three byte characters or four byte characters. Therefore, while processing a string you knew how many bytes to read before you could say you’ve read one character. At least that is how I recall things were in those days. Today I’m learning that the Unicode UTF-8 character set is not like that. It has some one byte characters, some two byte characters, some three byte characters and some four byte characters (up to six byte characters are possible).

So, the question I intend to answer on this page is “what are the implications of this for a PHP programmer?” One thing is for sure:

You need to know which encoding a string variable uses to be able to consistently process it correctly.

One thing to take into consideration when using UTF-8 with PHP is that some PHP functions do not take into account that some characters use multiple bytes; therefore, these functions will not work as expected.

One way around this problem is to only allow keyboard characters in strings—all other characters are then represented using their HTML entities.

PHP 6 will be using IBM’s ICU libraries to provide native support for character sets. This is, in general, very good news and brings PHP on a par with Java in this area.

ISO-8859-1 aka Latin-1

Internet browsers, the HTML standard, and many text editors use ISO-8859-1 aka Latin-1. This can cause problems if your PHP script thinks all strings are UTF-8 encoded.

Display problems:

Let’s say you have a UTF-8 encoded web page. And, this page has displayable characters for positions 0-31 and 127-159. Let’s also say the user’s browser recognizes that it is UTF-8 and converts it to its native ISO-8859-1 encoding. The following may happen: The displayable characters for positions 0-31 and 127-159 will be substituted for by a question mark type character. ISO-8859-1 explicitly does not define displayable characters for positions 0-31 and 127-159. The only characters in this range that are used are 9, 10 and 13, which are tab, newline and carriage return respectively.

Processing problems:

When character data comes from a filled out web form or when it comes from some text editors it may have been encoded in ISO-8859-1. If there are characters in that string whose encoding is peculiar to the ISO-8859-1 and your script expects UTF-8 then problems will happen. ISO-8859-1 is an 8-bit encoding scheme; for UTF-8 the first byte is encoded using a 7-bit encoding scheme; So, in the range above 127 the ISO-8859-1 characters will be encoded differently from their UTF-8 counterparts; the UTF-8 will start to use a second byte while ISO-8859-1 will continue filling up the single byte.

Both PHP4 + PHP5 xml-dom extensions use UTF-8 as internal encoding. This means that they mostly get it right, however there is one major GOTCHA, since they expect input strings to be utf8-encoded. If you use iso-8859-1 as your internal encoding (which you most likely do), this means that each and every string that you input to the DOM api should be encoded with utf8_encode(). It’s important to realize that you have to do this regardless of which encoding the document is out in. Annoying to say the least, but at least it’s consistent.

What are code points?

The concept of code points is common to all Unicode character sets. I will explain it before we get into the specific UTF-8 character set. Please note that I’m assuming you have a firm understanding of what a character is. For example you should know that F and f are two distinct characters. You should also know that a character can be any symbol used in writing.

A code point is a type of label associated with a character. This label will be the same for a particular character no matter which Unicode character set you are dealing with.

U+0046 is an example. It’s the code point for F.

The U+ tells you it’s Unicode; And, the 0046 is a hexadecimal number which has been assigned to the character F. Note that there are other characters which look like F but are different code points. You can find all the code points at http://www.unicode.org/.

What is a UTF-8 character made of?

First it is important to understand what UTF-8 is. UTF-8 is a Unicode system for storing code points in memory (a. k. a. encoding).

The paragraph which follows talks about a model of computer memory. This model is also applicable to values stored in PHP string variables.

What is memory? Computer memory is a series of on/off electronic switches. Each switch is a bit; Every eight bits make a byte. The bytes are all stacked sequentially next to each other. We will think of the bytes as being stacked from left to right. We will say the first byte has an address of zero; The second byte has an address of one; And so on… Each byte stores a number. This number can have a decimal value between 0 and 255. A byte can be thought of as a sequence of bits organized from left to right; Where the leftmost bit is the most significant bit and the rightmost bit is the least significant bit. In other words the bits will be laid out as you would write a binary number on a piece of paper.

If a string consisting of the word Jeff was stored in memory it would consist of the following:

  • the encoding of the character J followed by
  • the encoding of the character e followed by
  • the encoding of the character f followed by
  • the encoding of the character f

In other words it would consist of the following:

  • the encoding of the code point U+004A followed by
  • the encoding of the code point U+0065 followed by
  • the encoding of the code point U+0066 followed by
  • the encoding of the code point U+0066

So, what remains to be explained is how each of the code points is encoded. Here is an outline:

  • Code points U+00000000 to U+0000007f fit in a byte.
  • Code points U+00000080 to U+000007ff fit in two bytes.
  • Code points U+00000600 to U+0000ffff fit in three bytes.
  • Code points U+00010000 to U+001fffff fit in four bytes.
  • Code points U+00200000 to U+03ffffff fit in five bytes.
  • Code points U+04000000 to U+7fffffff fit in six bytes.
  • Code points U+00000000 to U+0000007f start with 0.
  • Code points U+00000080 to U+000007ff start with 110.
  • Code points U+00000600 to U+0000ffff start with 1110.
  • Code points U+00010000 to U+001fffff start with 11110.
  • Code points U+00200000 to U+03ffffff start with 111110.
  • Code points U+04000000 to U+7fffffff start with 1111110.
  • Let us call bytes after the first byte “continuation bytes”.
  • Continuation bytes start with 10; in contrast, the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
  • The rest of the bits store the binary of the code point.
Advertisements

About samehramzylabib

See About on https://samehramzylabib.wordpress.com
This entry was posted in Character Encoding, PHP and Filesystem and tagged . Bookmark the permalink.

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s