CESU-8
Draft Unicode Technical Report #26 proposes an encoding scheme with the rather unwieldy name of “Compatibility Encoding Scheme for UTF-16: 8-bit,” or “CESU-8” for short. CESU-8 is a variant of UTF-8 that treats the BMP characters in the same way as UTF-8 does, but deals with the supplementary-plane characters differently. In CESU-8, supplementary-plane characters are represented with six-byte sequences instead of four-byte sequences. In other words, the six-byte representation of supplementary-plane characters that is now illegal in UTF-8 is the preferred representation of these characters in CESU-8.
CESU-8 is what you get if you convert a sequence of Unicode code point values to UTF-16 and then convert the UTF-16 code units to UTF-8 code ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access