Encode XML Documents
Character encoding is quite important, especially as XML documents cross international boundaries. This hack will help you understand and use character encoding in XML.
To understand XML, you need to understand the characters that can make up XML documents. XML 1.0 supports the UCS standard, officially ISO/IEC 10646-1:1993 Information technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane, and its seven amendments (search for 10646 on http://www.iso.ch). Since the time that XML became a recommendation at the W3C, UCS has advanced to ISO/IEC 10646-1:2000. In addition, Unicode is a parallel standard to UCS (see http://www.unicode.org). XML 1.0 supports Unicode Version 2.0, but Unicode has advanced to Version 4.0 at this time, so there are differences in what XML 1.0 supports and in what the latest versions of UCS and Unicode support.
Both ISO/IEC 10646-1 UCS and Unicode assign the same values and descriptions for each character; however, Unicode defines some semantics for the characters that ISO/IEC 10646-1 does not.
Tip
Mike Brown’s XML tutorial at http://www.skew.org/xml/tutorial is good background reading on Unicode and character sets. To look up general character charts, see Kosta Kostis’s charts at http://www.kostis.net/charsets/. For Unicode character charts, go to http://www.unicode.org/charts/.
Each character in Unicode is represented by a unique, hexadecimal (base-16) number. The first 128 characters ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access