Unicode (ISO/IEC 10646-1)
SGML-based markup languages are required to define a document character set that serves as the basis for interpreting characters. The document character set for HTML (4 and 4.01), XHTML, and XML is the Universal Character Set (UCS) , which is a superset of all widely used standard character sets in the world.
The USC is defined by both the Unicode and ISO/IEC 10646 standards. The code points in Unicode and ISO/IEC 10646 are identical and the standards are developed in parallel. The difference is that Unicode adds some rules about how characters should be used. It is also used as a reference for such issues as the bidirectional text algorithm for handling reading direction within text. The Unicode Standard is defined by the Unicode Consortium (http://www.unicode.org).
Tip
In common practice, and throughout this book, the Universal Character Set is referred to simply as “Unicode.”
Because Unicode is the document character set for all (X)HTML documents, numeric character references in web documents will always be interpreted according to Unicode code points, regardless of the document’s declared encoding.