How encodings work
UTF-8
Variable-length: 1 byte for ASCII, 2 for Latin/Greek, 3 for most CJK, 4 for emoji and rare characters. The leading bits of each byte indicate its role:
0xxxxxxx-- single byte (ASCII)110xxxxx 10xxxxxx-- 2 bytes1110xxxx 10xxxxxx 10xxxxxx-- 3 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx-- 4 bytes
The grey bits are structure; the blue bits carry the code point.
UTF-16
Fixed 2 bytes for characters in the Basic Multilingual Plane (U+0000 to U+FFFF), which includes all common CJK. Characters above U+FFFF use a surrogate pair (4 bytes).
GB2312
A legacy Chinese encoding from 1980. Maps ~7,000 simplified Chinese characters to 2-byte codes. ASCII characters use 1 byte. This tool shows GB2312 values for a curated set of common characters.