
CHAPTER 6
Unicode Encodings
This chapter describes UTF-8 and other encodings for Unicode in detail, including the
algorithmic
descriptions and the practical considerations on choosing an encoding. It
concentrates on the UTF-8, UTF-16, and UTF-32 encodings, which are the current
official Unicode encodings. However, some older encodings are described as well, even
though not all of them are formally character encodings in a strict sense.
If you are not interested in the technicalities of encodings, you might read just the last
section of this chapter (“Choosing an Encoding). It summarizes the practical criteria,
but they can really be understood well only if you know the technical foundations.
Unicode Encodings in General
As described in Chapter 3, an encoding is a mapping from code numbers (which rep-
resent characters) to sequences of code units. A code unit is in practice an octet (8-bit
byte), a double octet (16-bit quantity), or a quadruple octet (32-bit quantity). The rea-
son for using such units is that modern computers have been designed to work on such
data objects efficiently.
Thus, the simplest encoding for Unicode is to map each code number to a quadruple
octet representing the number as a single integer in binary notation. Such an encoding,
UTF-32, is however too inefficient for most practical purposes.
Within a code unit of 16 or 32 bits, the order in which the octets are interpreted ...