The encoding Attribute

So far, we’ve been a little cavalier about character sets and character encodings. We’ve said that XML documents are composed of pure text, but we haven’t said what encoding that text uses. Is it ASCII? Latin-1? Unicode? Something else?

The short answer to this question is “Yes.” The long answer is that, by default, XML documents are assumed to be encoded in the UTF-8 variable-length encoding of the Unicode character set. This is a strict superset of ASCII, so pure ASCII text files are also UTF-8 documents. However, most XML processors, especially those written in Java, can handle a much broader range of character sets. All you have to do is tell the parser which character encoding the document uses. Preferably, this is done through metainformation, stored in the filesystem or provided by the server. However, not all systems provide character-set metadata, so XML also allows documents to specify their own character set with an encoding declaration inside the XML declaration. Example 2-8 shows how you’d indicate that a document was written in the ISO-8859-1 (Latin-1) character set that includes letters like ö and ç needed for many non-English Western European languages.

Example 2-8. An XML document encoded in Latin-1
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<person>
  Erwin Schrödinger
</person>

The encoding attribute is optional in an XML declaration. If it is omitted and no metadata is available, the Unicode character set is assumed. The ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.