O'Reilly logo

Unicode Demystified by Richard Gillam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

UTF-16 and the Surrogate Mechanism

That brings us to UTF-16. UTF-16 is the oldest Unicode encoding form, although its name goes back only a few years.

UTF-16 maps the 21-bit abstract code point values to sequences of 16-bit code units. For code point values in the BMP, which represent the vast majority of characters in any typical written document, this is a straightforward mapping. You just lop the five zero bits off the top, as shown in Figure 6.1.

Figure 6.1. UTF-16 mapping for BMP characters

For characters from the supplementary planes, the transformation is more complicated. To represent supplementary-plane characters, Unicode sets aside 2,048 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required