
On the other hand, the Unicode encodings are defined for noncharacters and for un-
assigned code points, too. If some data contains, for example, the code point U+FFFF,
which is defined to be a noncharacter, the data is incorrect as Unicode character data.
However, it is processed in a well-defined way when encoding the data in UTF-8,
UTF-16, or UTF-32. This guarantees that conversions between Unicode encodings do
not remove such errors but allow them to be detected.
The encodings UTF-8, UTF-16, and UTF-32 are all self-synchronizing. This feature,
also known as auto-synchronization , means that if malformed data (i.e., data that is
not possible according to the definition of the encoding) is encountered, only one code
point needs to be rejected. The start of the representation of the next code point can
be recognized easily. This helps guard against errors caused by data corruption in
transfer or storage: the effects of errors are local. If you have data like “Foobar” and
the character “b” is corrupted in storage or transfer, the data appears as “Foo?ar”
(where ? indicates corrupted data). In some other encodings, all data following a cor-
rupted character might appear as corrupted.
Sample program code, in the C language, for conversions between the Unicode encod-
ing forms is available at http://www.unicode.org/Public/PROGRAMS/CVTUTF/.
UTF-32 and UCS-4
UTF-32 uses a 32-bit ...