
Some Properties of UTF-8
Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as
shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear
in UTF-8. Other octets may appear in specific contexts only. This means that if you
have a large file that is not, in fact, character data in UTF-8 and you try to read it as
UTF-8, it is most probable that errors will be signaled.
Table 6-2. Octet ranges in UTF-8
Code range Octet 1 Octet 2 Octet 3 Octet 4
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string
directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be
processed correctly. The reason is that UTF-8 has been designed so that a code unit
starting the representation of a character can be recognized as such, even if the pre-
ceding code unit is in error.
Although the authoritative definition of UTF-8 is in the Unicode standard, with content
as described here, there is also a description of UTF-8 as an Internet standard, STD 63.
It is currently RFC 3629, “UTF-8, ...