CESU-8

Draft Unicode Technical Report #26 proposes an encoding scheme with the rather unwieldy name of “Compatibility Encoding Scheme for UTF-16: 8-bit,” or “CESU-8” for short. CESU-8 is a variant of UTF-8 that treats the BMP characters in the same way as UTF-8 does, but deals with the supplementary-plane characters differently. In CESU-8, supplementary-plane characters are represented with six-byte sequences instead of four-byte sequences. In other words, the six-byte representation of supplementary-plane characters that is now illegal in UTF-8 is the preferred representation of these characters in CESU-8.

CESU-8 is what you get if you convert a sequence of Unicode code point values to UTF-16 and then convert the UTF-16 code units to UTF-8 code ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.