
580
|
Chapter 9: Information Processing Techniques
mechanism for the UTF-16 encoding form. e High and Low Surrogates should never
be handled individually when converting to and from the UTF-8 and UTF-32 encod-
ing forms, and instead should be used together to map directly code points in Planes 1
through 16. Table 9-6 illustrates correct and incorrect interpretation of UTF-16 High and
Low Surrogates, using the rst characters of Planes 1 and 2, along with the last code point
of Plane 16 (which is classied as a noncharacter) as examples.
UTF-16 Surrogate conversion examplesTable 9-6.
Scalar value UTF-16BE UTF-8 UTF-32BE UTF-8 (Incorrect) UTF-32BE (Incorrect)
1 8 9 8 8 1 8 8 8
2 8 4 8 8 2 1 8 8 8 4
1 4 8 1
Note how the incorrect handling of the UTF-16 High and Low Surrogates is directly re-
lated to the handling of each code unit as a separate character rather than as a single unit.
If your soware generates UTF-8 and UTF-32 sequences that look like the UTF-8 (Incor-
rect) or UTF-32BE (Incorrect) columns in Table 9-6, then something is clearly wrong. e
fact that the result becomes the wrong number of code units is a good indicator. In the
case of ...