How encodings work

UTF-8

Variable-length: 1 byte for ASCII, 2 for Latin/Greek, 3 for most CJK, 4 for emoji and rare characters. The leading bits of each byte indicate its role:

  • 0xxxxxxx -- single byte (ASCII)
  • 110xxxxx 10xxxxxx -- 2 bytes
  • 1110xxxx 10xxxxxx 10xxxxxx -- 3 bytes
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx -- 4 bytes

The grey bits are structure; the blue bits carry the code point.

UTF-16

Fixed 2 bytes for characters in the Basic Multilingual Plane (U+0000 to U+FFFF), which includes all common CJK. Characters above U+FFFF use a surrogate pair (4 bytes).

GB2312

A legacy Chinese encoding from 1980. Maps ~7,000 simplified Chinese characters to 2-byte codes. ASCII characters use 1 byte. This tool shows GB2312 values for a curated set of common characters.