
CHAPTER 9
The Character Level and Above
In representation of texts, characters form but one protocol level, above which there
are higher levels such as markup level, record structure level, and application level.
Guidelines will be given about the coding of information at different levels when there
is choice, such as using markup versus character difference (largely still an open prob-
lem despite the efforts of the World Wide Web Consortium and the Unicode Consor-
tium). This is particularly important to processing of legacy data and to avoiding too
fine distinctions at character level. The chapter ends with a section on media types for
text and the difference between plain text, other subtypes of text, and application types
such as text processing formats.
Levels of Text Representation and Processing
The Unicode standard defines the term higher-level protocol as denoting “any agree-
ment on the interpretation of Unicode characters that extends beyond the scope of this
standard.” It adds a note: “Such an agreement need not be formally announced in data;
it may be implicit in the context.”
For example, an agreement such as the XML specification says that a sequence of char-
acters like π will be understood as a character reference (denoting the Greek small
letter pi π, U+03C0, in this case). This is an example of a very explicit agreement. The
scope of this agreement consists of XML documents, ...