Draft Unicode Technical Report #26 proposes an encoding scheme with the rather unwieldy name of “Compatibility Encoding Scheme for UTF-16: 8-bit,” or “CESU-8” for short. CESU-8 is a variant of UTF-8 that treats the BMP characters in the same way as UTF-8 does, but deals with the supplementary-plane characters differently. In CESU-8, supplementary-plane characters are represented with six-byte sequences instead of four-byte sequences. In other words, the six-byte representation of supplementary-plane characters that is now illegal in UTF-8 is the preferred representation of these characters in CESU-8.

CESU-8 is what you get if you convert a sequence of Unicode code point values to UTF-16 and then convert the UTF-16 code units to UTF-8 code ...

Get Unicode Demystified now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.