Unicode

How UTF-16 Encoding Works

This post explains exactly how UTF-16 encodes Unicode code points into bytes. If you haven’t read the history of how we got here, see The History of Text Encoding. From UCS-2 to UTF-16 Originally, Unicode was designed to fit in 16 bits - the Basic Multilingual Plane (BMP), covering code points U+0000 to U+FFFF. The encoding UCS-2 simply stored each code point as a 16-bit integer. When Unicode expanded beyond 65,536 characters (adding emoji, historical scripts, rare CJK characters, etc.), UCS-2 couldn’t represent the new code points. UTF-16 was created as a backward-compatible extension using surrogate pairs. ...

How UTF-8 Encoding Works

This post explains exactly how UTF-8 encodes Unicode code points into bytes. If you haven’t read the history of how we got here, see The History of Text Encoding. The Design Goals UTF-8 was designed by Ken Thompson and Rob Pike with specific goals in mind: ASCII compatibility: Bytes 0x00-0x7F mean exactly what they mean in ASCII Self-synchronization: You can identify character boundaries from any position No NUL bytes: Except for the actual NUL character (U+0000), no byte is ever 0x00 Sortable: Byte-wise sorting of UTF-8 strings sorts by code point order The Encoding Scheme UTF-8 uses a variable number of bytes (1-4) depending on the code point: ...

The History of Text Encoding: From ASCII to Unicode

Text encoding is a topic that not enough programmers understand well. It rarely appears on computer science curriculums, even at universities, yet it’s something almost every developer will encounter in the real world. It underpins every file you save, every message you send, every webpage you load. This represents a widespread and ongoing failure in programming education. When I first tried to learn about this topic, I found most articles confusing and poorly written. The top search results were either too abstract or buried the important details under walls of jargon. I ended up putting off learning this topic for years. This set of articles are a resource I wish I had back then. ...