Unicode and Character Sets

April 12, 2021

Overview

History of Unicode spans multiple versions, but the latest and most popular is UTF-8.
UTF-16 is the only web-encoding incompatible with ASCII, and never gained popularity on the web, where it is used by under 0.005% (less than 1 hundredth of 1 percent) of web pages. UTF-8, by comparison, is used by 97% of all web pages.
UTF-8 is compatible with ASCII so it’s the same for the first 128 characters. Therefore representing a ASCII compatible character in UTF-8 the byte’s first bit would always be 0, leaving remaining 7 for use. If outside ASCII representable characters multiple bytes would be required.

Example with multiple bytes required.

  Char. number range   |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal.

Example Hello which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

UTF-8

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. Every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII.

Other ways of encoding Unicode

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.

Unknown code point �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

Encoding

How do you specify which encoded you are content is written, well a content type, html pages have a similar thing with the meta tag in the head.

Content-Type: text/plain; charset="UTF-8"