Some Properties of UTF-8
Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as
shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear
in UTF-8. Other octets may appear in specific contexts only. This means that if you
have a large file that is not, in fact, character data in UTF-8 and you try to read it as
UTF-8, it is most probable that errors will be signaled.
Table 6-2. Octet ranges in UTF-8
Code range Octet 1 Octet 2 Octet 3 Octet 4
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string
directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be
processed correctly. The reason is that UTF-8 has been designed so that a code unit
starting the representation of a character can be recognized as such, even if the pre-
ceding code unit is in error.
Although the authoritative definition of UTF-8 is in the Unicode standard, with content
as described here, there is also a description of UTF-8 as an Internet standard, STD 63.
It is currently RFC 3629, “UTF-8, a transformation format of ISO 10646,” and available
at http://www.ietf.org/rfc/rfc3629.txt. It contains additional recommendations (by the
IETF) regarding the use of UTF-8 on the Internet, especially with regards to protocol
design.
Byte Order
A unit that consists of two or four octets, such as the code units in UTF-16 and UTF-32,
has a logical order of octets. For example, if you interpret a two-octet unit as a single
unsigned integer (in the range 0..FFFF in hexadecimal, 0..65,535 in decimal), one of
the octets is treated as more significant than the other.
Strange as it may sound, the physical order of octets within a unit may differ from their
logical order. This might be compared to storing a string like “42” so that “2” appears
first in storage, then “4.” Specifically, the physical order of octets in a two-octet unit
308 | Chapter 6:Unicode Encodings

Get Unicode Explained now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.