UTF-16 and the Surrogate Mechanism

That brings us to UTF-16. UTF-16 is the oldest Unicode encoding form, although its name goes back only a few years.

UTF-16 maps the 21-bit abstract code point values to sequences of 16-bit code units. For code point values in the BMP, which represent the vast majority of characters in any typical written document, this is a straightforward mapping. You just lop the five zero bits off the top, as shown in Figure 6.1.

Figure 6.1. UTF-16 mapping for BMP characters

For characters from the supplementary planes, the transformation is more complicated. To represent supplementary-plane characters, Unicode sets aside 2,048 ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.