Documents are the heart of XML. Any amount of usable XML is presented as a document, often stored in a file. One of the very first things you must understand in order to use XML is how to create a well-formed document. In this section, we examine the syntactic components of a document, starting with the individual characters and looking at how they are viewed when building larger syntactic constructs. Then we look at the constructs defined for all documents by the XML recommendation.
The XML Specification defines a character as “an atomic unit of text as specified by ISO/IEC 10646.” (Remember, ISO/IEC 10646 is more commonly referred to as Unicode.) Of course, this explanation is exactly what you should say at a party if someone asks. One of the goals of both standardization and XML is to make documents easily understandable by platforms around the globe. As such, simple things like ASCII characters can become quite complex.
Regardless, the specification states that legal characters are “tab, carriage return, line feed,” as well as belonging to the aforementioned Unicode specification. If you were to write an XML parser, the topic of characters and standardization would be of incredible importance to you. For the rest of us, it’s usually enough to choose an XML parser that gets it right.
You can declare the character encoding used in an XML document using the optional XML declaration:
<?xml version="1.0" encoding="UTF-8"?>