
ciple, since UCS-4 operated on a 31-bit coding space, UTF-32 on a 21-bit coding space.
The decision to stick to 21-bit coding space removed the distinction. The difference is
now nominal, and it is more natural to use the name UTF-32.
UTF-16 and UCS-2
The UCS-2 and UTF-16 encodings use 16-bit code units. In these encodings, all char-
acters in the Basic Multilingual Plane (BMP), and hence most characters that people
use these days, are represented directly: a character is represented as one code unit. It
represents the code number of the character as one unsigned 16-bit integer. Thus, the
encodings are structurally simpler than UTF-8.
UCS-2 Is BMP Only
UCS-2 is by definition limited to BMP. It is therefore not a full Unicode encoding: you
cannot represent all Unicode data in UCS-2. On the other hand, UTF-16 is basically
UCS-2 enhanced with a mechanism (surrogate pairs) for representing Unicode char-
acters outside BMP. If you don’t use such characters, UTF-16 effectively behaves as
UCS-2.
Thus, UCS-2 can be regarded as mainly historical. It is however still part of the ISO
10646 standard—but not part of the Unicode standard. The registered MIME name of
UCS-2 is ISO-10646-UCS-2.
Surrogate Pairs in UTF-16
UTF-16 uses surrogate pairs to overcome the 16 bit limitation. This means that some
16-bit values have been reserved for use as a high (leading) or low (trailing) value in a
pair of code units. ...