
200
|
Chapter 4: Encoding Methods
as an ordered sequence of bytes. For encoding forms that make use of code units that leap
beyond the single byte, such as UTF-16 and UTF-32, byte order is absolutely critical.
e Unicode encodings that are aected by this byte order issue can (and should) make
use of the important BOM, covered in the previous section, in order to explicitly indicate
the byte order of the le. Incorrect interpretation of byte order can lead to some fairly
“amusing” results. For example, consider 4, the very rst ideograph in Unicode’s
CJK Unied Ideographs URO. Its big-endian byte order is <4 >, and if it were to be
reversed to become < 4>, it would be treated as though it were the uppercase Latin
character “N” (4 or ASCII 4). is is not what I would consider to be desired
behavior….
BMP Versus Non-BMP
Unicode was originally designed to be represented using only 16-bit code units. is is
equivalent to and fully contained in what is now referred to as Plane 0 or the BMP. e
now obsolete and deprecated UCS-2 encoding represented a pure 16-bit form of Unicode.
Unicode version 2.0 introduced the UTF-16 encoding form, which is UCS-2 encoding
with an extension mechanism that allows 16 additional planes of 65,536 code points each
to be encoded. e code points for these 16 additional planes, because they are outside or
beyond the BMP, are considered to ...