
—i.e., notations of the form &# n ; or &#x n ;. The document character set is the character
code (mapping of integers to characters) according to which the n in such notations is
to be interpreted.
In particular, HTML and XML specifications do not impose Unicode semantics on
characters, for two reasons: they formally refer to ISO 10646, not the Unicode standard,
and even if they referred to Unicode, this would not constitute a requirement on con-
formance to the standard. Of course, software that processes HTML or XML docu-
ments may apply Unicode semantics and rules, such as line breaking rules, but this is
not a requirement. Only for some features related to directionality do HTML specifi-
cations refer to Unicode rules normatively.
The HTML specifications contain some special restrictions on the use of control char-
acters, as listed in Table 11-2. There is usually little reason why control characters other
than line breaks and sometimes horizontal tabs would appear in HTML documents.
They may, however, appear due to conversions. The rules for them are somewhat dif-
ferent in HTML up to and including HTML 4.01 and in XHTML. (Technically, the
SGML declaration for HTML 4.01 disallows U+000C, but the prose discusses it as an
allowed character. It would anyway be whitespace and not a page eject character.)
Table 11-2. C0 and C1 Control characters in HTML
Character(s) Explanation Use in HTML
U+0000..U+0008 ...