
unit to determine a character. If it is a low surrogate, you need to read the preceding
code unit.
Since conformance to the Unicode standard does not require support for all Unicode
characters, it is quite permissible for an implementation to be ignorant of all characters
outside the BMP. It could be incapable of rendering any of them or processing them in
any useful way. However, for conformance, an implementation must be able to recog-
nize that there is a surrogate code unit pair UTF-16 encoded data. It must not treat the
code units in it as two characters but as a representation of one character, although
perhaps a completely unknown character.
UTF-8
UTF-8 uses 8-bit code units, and it represents characters in the Basic Latin (ASCII)
range U+0000 to U+007F efficiently, one code unit per character. On the other hand,
this implies that all other characters use at least two code units, which all have the most
significant bit set—i.e., they are in the range 80 to FF (hexadecimal). More exactly,
they are in the range 80 to 9F. This means that when there is a code unit in the range
00 to 7F in UTF-8 data, we can know that it represents a Basic Latin character and
cannot be part of the representation of some other character.
These structural decisions imply that UTF-8 is relatively inefficient, since it leaves many
simple combinations unused. There is yet another principle that has a similar effect. ...