Skip to main content

What Is Unicode?

·1237 words
Table of Contents

I used to work in a codebase that was encoded using the Latin-1 (aka ISO 8859-1) encoding. Every so often, somebody’s IDE would automatically re-encode it into UTF-8. If this change wasn’t caught in code review and was deployed to production, all of our translations would go haywire. This caused all kinds of chaos for customers and customer service reps.

Naturally, everybody wanted to switch to UTF-8. After all, unicode is the modern standard for character encodings. But wait, what even is UTF-8? What is Unicode? Are they different names for the same thing? Isn’t that what the “U” stands for in UTF? What about other UTF encodings like UTF-16? Today I’m going to get to the bottom of this.

History, definitions, and compatibility
#

ASCII and ISO 8859
#

ASCII (American Standard Code for Information Interchange) is a character encoding from the 1960s that defines byte representations for 128 characters. Specifically, the bytes 0x00 through 0x7F. In other words, given a byte in this range, ASCII defines which character should be printed. 0x3B represents “;”, 0x76 represents “v”, etc. It can only represent 128 characters because it only uses 7 bits of a byte1 (\(2^7 = 128\)).

What do we do if we want to represent a character that doesn’t fall within the 128 characters defined by ASCII? We need a new character encoding. ISO 8859 is a standard from the 1980s which extends the original ASCII definitions by using the 8th bit. In this way, it can define an additional 128 mappings. It actually contains several different extensions, since 128 additional characters is still not enough for all texts. So depending on what characters you find important, you can choose a character encoding that supports the characters you need. For example, if your language is a Western European language, you probably want to use the encoding ISO 8859-1 (aka Latin-1), which covers characters from English, German, Spanish, Italian, and several other languages. This includes characters such as 0xEA ê, 0xF1 ñ and 0xA3 £.

On the other hand, if you’re writing in Greek, you probably want to use the encoding ISO 8859-7 (aka Latin/Greek). Those same 3 bytes from Latin-1 correspond to κ, ρ and £, respectively.

Byte ASCII Decoded Latin-1 Decoded Latin/Greek Decoded
0x44 D D D
0xA3 N/A £ £
0xEA N/A ê κ
0xF1 N/A ñ ρ

Can you guess the flaw in this? Imagine what happens if you’re writing a text in say, French or Spanish, with lots of ê or ñ. Then you publish it or send it across a network, and it gets opened by a Greek speaker whose default encoding is Latin/Greek. The part of the text that is the same in both Latin-1 and Latin/Greek encodings will look normal, but anywhere the encodings disagree will look garbled. Example2:

>>> latin1_bytes = b'L\'\xe9lectricit\xe9 devra \xeatre r\xe9tablie'
>>> print(latin1_bytes.decode('iso-8859-1'))
L'électricité devra être rétablie
>>> print(latin1_bytes.decode('iso-8859-7'))
L'ιlectricitι devra κtre rιtablie

There is no way to determine the correct encoding from the text alone; this requires some extra metadata to be included. In hypertext (HTTP), this is the Content-Type header.

❯ curl -sI https://en.wikipedia.org/wiki/ISO/IEC_8859 | grep -i 'content-type:'
content-type: text/html; charset=UTF-8

But what do you do if you want characters in your document from different encodings? Suppose you’re writing a math paper primarily in French, but you need a lot of Greek characters such as π, θ, ψ, etc, for the equations. Well, you’re kind of out of luck if you’re using ASCII or one of the ISO 8859 extensions. You can use ISO 8859-1 for all the French characters (accented Latin letters), or ISO 8859-7 for all the Greek characters, but not both.

Unicode and UTF-8
#

Enter Unicode. Unicode was originally designed in the late 1980s, contemporaneously with the ISO 8859 standard. Unicode defines a bunch of characters and gives each one a number (aka code point). Version 16.0 of the Unicode Standard defines ~155k characters, although the standard has room for up to 1.1M characters. Unicode itself does not define byte encodings, only code points. To get to bytes, you need a Unicode encoding. UTF-8 is one such encoding that defines a mapping from Unicode code point to bytes. There are others as well, such as UTF-16 and UTF-32.

UTF-8 is a variable-length character encoding. One ASCII character is always 1 byte. One UTF-32 character is always 4 bytes. One UTF-8 character, on the other hand, can be anywhere from 1 to 4 bytes. For backwards compatibility with ASCII, it uses a single byte encoding for codepoints 0x00 through 0x7F (ASCII range). Since the vast majority of (Western) text is in this range, this also means most UTF-8 encoded text takes up ~25% of the space of the equivalent UTF-32 text. There are similar space savings for other multi-byte encodings.

How does UTF-8 differ ISO 8859?
#

Latin-1 can encode 256 characters, and can decode all bytes. There are no errors, but there are also no internal check mechanisms. All 256 bytes represent a valid character.

UTF-8 can encode all 1.1M unicode characters. It is a map from the set of unicode code points [U+000000, U+10FFFF] to the integers [0, 68.7M] (up to 4 bytes, \(2^{36}-1\) combinations), although the mapping is not “onto” (ie. it doesn’t fill the space). Because not every integer in the space is mapped to, there is a degree of internal consistency/error checking.

There is a great description on Wikipedia of how the encoding works, so I’ll just summarize a few key points. UTF-8 is designed to correspond as much as possible with both ASCII and ISO 8859-1. For the ASCII range (0x00 through 0x7F), UTF-8 is exactly the same. UTF-8 uses the top bit to indicate a multi-byte character, so the bytes 0x80 through 0xFF are not valid encodings (since they have the top bit set, indicating there should be more than 1 byte).

This is the major difference that causes incompatibility between ISO 8859 and UTF-8. If you try to decode ISO 8859-1 text using UTF-8, it gets converted to a replacement character to indicate there is no mapping.

>>> print(latin1_bytes.decode('utf-8', 'replace'))
L'�lectricit� devra �tre r�tablie

If you try to decode UTF-8 text using ISO 8859-1, all bytes are interpreted without error, and the result looks like garbage.

>>> utf8_bytes = b'L\xe2\x80\x99\xc3\xa9lectricit\xc3\xa9 devra \xc3\xaatre r\xc3\xa9tablie'
>>> print(utf8_bytes.decode('utf-8'))
Lélectricité devra être rétablie
>>> print(utf8_bytes.decode('iso-8859-1'))
€™Ã©lectricité devra être ©tablie

It is worth noting, however, that the first two sub-headings of the UTF-8 Latin-1 Supplement are designed to correspond with ISO 8859-1, so that text containing ISO 8859-1 encoded punctuation and symbols is still mostly readable if decoded using UTF-8:

>>> b=b'\xc2\xa33.50'
>>> print(b.decode('utf-8')); 
£3.50
>>> print(b.decode('iso-8859-1'))
£3.50

Summary
#

Unicode is a standard that assigns a number (code point, NOT a byte) to each character. A Unicode encoding, such as UTF-8 or UTF-16, assigns a byte for each code point. The ASCII range of ISO 8859 and UTF-8 are identical; however, they differ when the top bit is set (0x80 and above). This makes some ISO 8859 bytes incompatible with UTF-8, and results in the replacement character ("�"). On the other hand, all UTF-8 characters are valid ISO 8859 characters, which results in jumbled text (such as replacing “é” with “é”).

Recommended Resources #

Dylan Beattie: Plain Text

Recollection from Rob Pike about how Ken Thompson designed UTF-8 on a placemat at a diner


  1. Why only 7 bits and not 8? Historical reasons↩︎

  2. source ↩︎