Short Discourse on Unicode

Introduction

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not actually correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

In fact, Unicode makes you think a different way about encoding characters. Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

The Depth of Unicode

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is another story.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called acode point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041.

Encoding

The earliest idea for Unicode encoding, which led to the myth about the two bytes was let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00?!?

The people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE.

For a while it seemed like that might be good enough, but American programmers were complaining. English rarely used code points above U+00FF and the thought of wasted bytes shocked them. Besides, who’s going to convert the ASCII character sets over?

Thus was invented UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

This has the neat side effect that English text looksexactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F.

There are actually loads of other encoding out there but since December 2007, the most popular encoding on the web has been UTF-8 which it’ll be good to know more about it. It’s important to know the encoding for messages over the web. No such thing as plain text anymore!

Content Types

For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset=”UTF-8″

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this seems to be a Catch-22: “how can you read the HTML file until you know what encoding it’s in?!” Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<meta http-equiv=”Content-Type”content=”text/html; charset=utf-8″>

But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

Conclusion

What if a poorly informed web designer doesn’t put this tag though? Apparently, every browser does something different to try to guess the character set. Sometimes, they get it right. Sometimes wrong which you the web browser will have to try to figure out what encoding the web designer actually meant for you to use. Is it Chinese? Hindi? Arabic? Good luck!

Thus comes the central point that for every web document you code up, be sure to include the encoding type!

Leave a Reply