Not all compatibility characters are compatibility decomposable. Many
of them have decompositions that are canonical.
W3C Normalization
The World Wide Web Consortium (W3C) favors Normalization Form C on the Web,
and it additionally suggests stronger normalization rules in HTML and XML docu-
ments. The stronger rules are external to Unicode, since they relate to markup, not
plain text. They are briefly described here due to their practical impact. The rules are
described in more detail in the document “Character Model for the World Wide Web
1.0: Normalization,” http://www.w3.org/TR/charmod-norm/. However, it needs to be
noted that document is officially a Working Draft (work in progress) only.
The W3C normalization rules require that text be in NFC and additionally forbid the
occurrence of character references and entity references that would make the text non-
normalized, if replaced by the characters that they denote. For example, by Unicode
rules, NFC does not allow the appearance of “e” followed by a combining acute accent,
since this combination must be replaced by the precomposed character é. The W3C
normalization rules also forbid the indirect appearance of the combination, for exam-
ple, as in é (where ́ is a character reference that denotes the combining
acute accent U+0301).
On the Web, expressions like é are rarely used in practice, since the corre-
sponding precomposed character (either written as such or as a character reference like
é or é or as an entity reference like &#eacute;) works much better. However,
suppose that you have a database that contains characters in decomposed form. Unless
you are careful, software that presents data extracted from it in HTML or XML format
might treat data like U+0065 U+0301 so that U+0065 is represented directly as “e”
(which should cause no problems), whereas U+0301 is converted to ́ for safety.
This would result in data that is not W3C normalized, and this involves unnecessary
risks. A simple way to avoid this is to normalize (to NFC) the character data extracted
from the database before making any decisions on using character references to repre-
sent some characters.
Case Properties
Some writing systems, such as Latin, Greek, and Cyrillic, make a distinction between
cases of letters. Historically, uppercase letters, also known as capital letters or as
majuscules, reflect the original shapes of letters. In the middle ages, lowercase letters,
also known as small letters or as minuscules were invented to make writing by hand
faster. Uppercase letters were preserved for special use—e.g., for emphasis, for abbre-
viations, and for use as initials in proper names and in the first word of a sentence.
Case Properties | 251

Get Unicode Explained now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.