
Unicode Encoding Methods
|
199
appropriate name). When it occurs in all other contexts (that is, buried within a le,
stream, buer, or string), it is used as a Zero-Width No-Break Space (ZWNBSP). However,
its use as a ZWNBSP is deprecated, and Word Joiner (26) should be used instead. e
BOM is necessary only for the UTF-16 and UTF-32 encoding forms, but it is also useful
for the UTF-8 encoding form in that it serves as an indicator that the data is intended to
be Unicode, and not ASCII or a legacy encoding method that happens to include ASCII
as a subset.
e Replacement Character (
?
; ) is generically used for characters that cannot be
represented in Unicode, or for representing invalid or illegal input, such as invalid UTF-8
byte sequences, unpaired UTF-16 High or Low Surrogates, and so on.
Some of these special characters will be referred to throughout this chapter, and perhaps
elsewhere in this book. I know from personal experience that being aware of these char-
acters and their proper usage is extremely helpful.
Unicode Scalar Values
When Unicode characters are referenced outside the context of a specic encoding form,
Unicode scalar values are used. is is a notation that serves to explicitly identify Unicode
characters and unambiguously distinguishes them from other characters, and from each
other. is notation also supports sequences of Unicode code points. ...