ciple, since UCS-4 operated on a 31-bit coding space, UTF-32 on a 21-bit coding space.
The decision to stick to 21-bit coding space removed the distinction. The difference is
now nominal, and it is more natural to use the name UTF-32.
UTF-16 and UCS-2
The UCS-2 and UTF-16 encodings use 16-bit code units. In these encodings, all char-
acters in the Basic Multilingual Plane (BMP), and hence most characters that people
use these days, are represented directly: a character is represented as one code unit. It
represents the code number of the character as one unsigned 16-bit integer. Thus, the
encodings are structurally simpler than UTF-8.
UCS-2 Is BMP Only
UCS-2 is by definition limited to BMP. It is therefore not a full Unicode encoding: you
cannot represent all Unicode data in UCS-2. On the other hand, UTF-16 is basically
UCS-2 enhanced with a mechanism (surrogate pairs) for representing Unicode char-
acters outside BMP. If you don’t use such characters, UTF-16 effectively behaves as
UCS-2.
Thus, UCS-2 can be regarded as mainly historical. It is however still part of the ISO
10646 standard—but not part of the Unicode standard. The registered MIME name of
UCS-2 is ISO-10646-UCS-2.
Surrogate Pairs in UTF-16
UTF-16 uses surrogate pairs to overcome the 16 bit limitation. This means that some
16-bit values have been reserved for use as a high (leading) or low (trailing) value in a
pair of code units. Together these values denote a Unicode character outside BMP. The
word “surrogate” is not very descriptive, and it has caused much confusion; in reality,
the “surrogates” are simply an extension mechanism.
More exactly, a high surrogate is a code unit in the range D800 to DBFF, and a low
surrogate is in the range DC00 to DFFF. We use hexadecimal numbers here without
the “U+” prefix to emphasize that the surrogates are code units, not code points. Two
consecutive surrogate code units together denote one code point, which is outside BMP
—i.e., in the range U+10000 to U+10FFFF.
Surrogate code units have a defined meaning only when they appear in a pair of a high
surrogate and a low surrogate. Otherwise, they have no defined meaning, and they are
data errors.
A surrogate code unit pair is constructed by the following algorithm:
1. Given a Unicode code point outside BMP—i.e., with value > FFFF—represent it
as a 21-bit integer, with leading zeros as necessary.
304 | Chapter 6:Unicode Encodings

Get Unicode Explained now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.