Character References
Unicode contains more than 96,000 different characters covering almost all of the world’s written languages. Predefining entity references for each of these characters, most of which will never be used in any one document, would impose an excessive burden on XML parsers. Rather than pick and choose which characters are worthy of being encoded as entities, XML goes to the other extreme. It predefines entity references only for characters that have special meaning as markup in an XML document: <, >, &, “, and ‘. All these are ASCII characters that are easy to type in any text editor.
For other characters that may not be accessible from an ASCII
text editor, XML lets you use character
references. A character reference gives the number of the
particular Unicode character it stands for, in either decimal or hexadecimal. Decimal character references
look like њ
; hexadecimal
character references have an extra x
after the &#
;; that is, they look like њ
. Both of these references refer
to the same character, њ
, the Cyrillic small letter “nje” used in Serbian and Macedonian. For
example, suppose you want to include the Greek maxim "σ ο φÓς ε α υ τÓ ν γ ι γ ν ω σ κ ε ι" (“The wise man knows himself”) in your XML
document. However, you only have an ASCII text editor at your
disposal. You can replace each Greek letter with the correct character
reference, like this:
<maxim> σοφός έαυτόν γιγνώσκει ...
Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.