BUY THIS BOOK
Add to Cart

Print Book $39.95


Add to Cart

Print+PDF $51.94

Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint or License this content?


XML in a Nutshell
XML in a Nutshell, Third Edition By Elliotte Rusty Harold, W. Scott Means
September 2004
Pages: 712

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introducing XML
XML, the Extensible Markup Language, is a W3C-endorsed standard for document markup. It defines a generic syntax used to mark up data with simple, human-readable tags. It provides a standard format for computer documents that is flexible enough to be customized for domains as diverse as web sites, electronic data interchange, vector graphics, genealogy, real estate listings, object serialization, remote procedure calls, voice mail systems, and more.
You can write your own programs that interact with, massage, and manipulate the data in XML documents. If you do, you'll have access to a wide range of free libraries in a variety of languages that can read and write XML so that you can focus on the unique needs of your program. Or you can use off-the-shelf software, such as web browsers and text editors, to work with XML documents. Some tools are able to work with any XML document. Others are customized to support a particular XML application in a particular domain, such as vector graphics, and may not be of much use outside that domain. But the same underlying syntax is used in all cases, even if it's deliberately hidden by the more user-friendly tools or restricted to a single application.
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup that describes the data. XML's basic unit of data and markup is called an element . The XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. Superficially, the markup in an XML document looks a lot like the markup in an HTML document, but there are some crucial differences.
Most importantly, XML is a metamarkup language . That means it doesn't have a fixed set of tags and elements that are supposed to work for everybody in all areas of interest for all time. Any attempt to create a finite set of such tags is doomed to failure. Instead, XML allows developers and writers to invent the elements they need as they need them. Chemists can use elements that describe molecules, atoms, bonds, reactions, and other items encountered in chemistry. Real estate agents can use elements that describe apartments, rents, commissions, locations, and other items needed for real estate. Musicians can use elements that describe quarter notes, half notes, G-clefs, lyrics, and other objects common in music. The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Benefits of XML
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup that describes the data. XML's basic unit of data and markup is called an element . The XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. Superficially, the markup in an XML document looks a lot like the markup in an HTML document, but there are some crucial differences.
Most importantly, XML is a metamarkup language . That means it doesn't have a fixed set of tags and elements that are supposed to work for everybody in all areas of interest for all time. Any attempt to create a finite set of such tags is doomed to failure. Instead, XML allows developers and writers to invent the elements they need as they need them. Chemists can use elements that describe molecules, atoms, bonds, reactions, and other items encountered in chemistry. Real estate agents can use elements that describe apartments, rents, commissions, locations, and other items needed for real estate. Musicians can use elements that describe quarter notes, half notes, G-clefs, lyrics, and other objects common in music. The X in XML stands for Extensible. Extensible means that the language can be extended and adapted to meet many different needs.
Although XML is quite flexible in the elements it allows, it is quite strict in many other respects. The XML specification defines a grammar for XML documents that says where tags may be placed, what they must look like, which element names are legal, how attributes are attached to elements, and so forth. This grammar is specific enough to allow the development of XML parsers that can read any XML document. Documents that satisfy this grammar are said to be
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What XML Is Not
XML is a markup language, and it is only a markup language. It's important to remember that. The XML hype has gotten so extreme that some people expect XML to do everything up to and including washing the family dog.
First of all, XML is not a programming language . There's no such thing as an XML compiler that reads XML files and produces executable code. You might perhaps define a scripting language that used a native XML format and was interpreted by a binary program, but even this application would be unusual. XML can be used as a format for instructions to programs that do make things happen, just like a traditional program may read a text config file and take different actions depending on what it sees there. Indeed, there's no reason a config file can't be XML instead of unstructured text. Some more recent programs use XML config files; but in all cases, it's the program taking action, not the XML document itself. An XML document by itself simply is. It does not do anything.
At least one XML application, XSL Transformations (XSLT), has been proven to be Turing complete by construction. See http://www.unidex.com/turing/utm.htm for one universal Turing machine written in XSLT.
Second, XML is not a network transport protocol . XML won't send data across the network, any more than HTML will. Data sent across the network using HTTP, FTP, NFS, or some other protocol might be encoded in XML; but again there has to be some software outside the XML document that actually sends the document.
Finally, to mention the example where the hype most often obscures the reality, XML is not a database . You're not going to replace an Oracle or MySQL server with XML. A database can contain XML data, either as a VARCHAR or a BLOB or as some custom XML data type, but the database itself is not an XML document. You can store XML data in a database on a server or retrieve data from a database in an XML format, but to do this, you need to be running software written in a real programming language such as C or Java. To store XML in a database, software on the client side will send the XML data to the server using an established network protocol such as TCP/IP. Software on the server side will receive the XML data, parse it, and store it in the database. To retrieve an XML document from a database, you'll generally pass through some middleware product like Enhydra that makes SQL queries against the database and formats the result set as XML before returning it to the client. Indeed, some databases may integrate this software code into their core server or provide plug-ins to do it, such as the Oracle XSQL servlet. XML serves very well as a ubiquitous, platform-independent transport format in these scenarios. However, it is not the database, and it shouldn't be used as one.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Portable Data
XML offers the tantalizing possibility of truly cross-platform, long-term data formats. It's long been the case that a document written on one platform is not necessarily readable on a different platform, or by a different program on the same platform, or even by a future or past version of the same program on the same platform. When the document can be read, there's no guarantee that all the information will come across. Much of the data from the original moon landings in the late 1960s and early 1970s is now effectively lost. Even if you can find a tape drive that can read the now obsolete tapes, nobody knows what format the data is stored in on the tapes!
XML is an incredibly simple, well-documented, straightforward data format. XML documents are text and can be read with any tool that can read a text file. Not just the data, but also the markup is text, and it's present right there in the XML file as tags. You don't have to wonder whether every eighth byte is random padding, guess whether a four-byte quantity is a two's complement integer or an IEEE 754 floating point number, or try to decipher which integer codes map to which formatting properties. You can read the tag names directly to find out exactly what's in the document. Similarly, since element boundaries are defined by tags, you aren't likely to be tripped up by unexpected line-ending conventions or the number of spaces that are mapped to a tab. All the important details about the structure of the document are explicit. You don't have to reverse-engineer the format or rely on incomplete and often unavailable documentation.
A few software vendors may want to lock in their users with undocumented, proprietary, binary file formats. However, in the long term, we're all better off if we can use the cleanly documented, well-understood, easy to parse, text-based formats that XML provides. XML lets documents and data be moved from one system to another with a reasonable hope that the receiving system will be able to make sense out of it. Furthermore, validation lets the receiving side check that it gets what it expects. Java promised portable code; XML delivers portable data. In many ways, XML is the most portable and flexible document format designed since the ASCII text file.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How XML Works
Example 1-1 shows a simple XML document. This particular XML document might be seen in an inventory-control system or a stock database. It marks up the data with tags and attributes describing the color, size, bar-code number, manufacturer, name of the product, and so on.
Example 1-1. An XML document
<?xml version="1.0"?>
<product barcode="2394287410">
  <manufacturer>Verbatim</manufacturer>
  <name>DataLife MF 2HD</name>
  <quantity>10</quantity>
  <size>3.5"</size>
  <color>black</color>
  <description>floppy disks</description>
</product>
This document is text and can be stored in a text file. You can edit this file with any standard text editor such as BBEdit, jEdit, UltraEdit, Emacs, or vi. You do not need a special XML editor. Indeed, we find most general-purpose XML editors to be far more trouble than they're worth and much harder to use than simply editing documents in a text editor.
Programs that actually try to understand the contents of the XML document—that is, do more than merely treat it as any other text file—will use an XML parser to read the document. The parser is responsible for dividing the document into individual elements, attributes, and other pieces. It passes the contents of the XML document to an application piece by piece. If at any point the parser detects a violation of the well-formedness rules of XML, then it reports the error to the application and stops parsing. In some cases, the parser may read further in the document, past the original error, so that it can detect and report other errors that occur later in the document. However, once it has detected the first well-formedness error, it will no longer pass along the contents of the elements and attributes it encounters.
Individual XML applications normally dictate more precise rules about exactly which elements and attributes are allowed where. For instance, you wouldn't expect to find a G_Clef element when reading a biology document. Some of these rules can be precisely specified with a schema written in any of several languages, including the W3C XML Schema Language, RELAX NG, and DTDs. A document may contain a URL indicating where the schema can be found. Some XML parsers will notice this and compare the document to its schema as they read it to see if the document satisfies the constraints specified there. Such a parser is called a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Evolution of XML
XML is a descendant of SGML, the Standard Generalized Markup Language. The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by several hundred people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way XML solves them. It is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
SGML's biggest success was HTML, which is an SGML application. However, HTML is just one SGML application. It does not have or offer anywhere near the full power of SGML itself. Since it restricts authors to a finite set of tags designed to describe web pages—and describes them in a fairly presentation oriented way at that—it's really little more than a traditional markup language that has been adopted by web browsers. It doesn't lend itself to use beyond the single application of web page design. You would not use HTML to exchange data between incompatible databases or to send updated product catalogs to retailer sites, for example. HTML does web pages, and it does them very well, but it only does web pages.
SGML was the obvious choice for other applications that took advantage of the Internet but were not simple web pages for humans to read. The problem was that SGML is complicated—very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implemented or relied on different subsets of SGML were often incompatible with each other. The special feature one program considered essential would be considered extraneous fluff and omitted by the next program.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: XML Fundamentals
This chapter shows you how to write simple XML documents. You'll see that an XML document is built from text content marked up with text tags such as <SKU>, <Record_ID>, and <author> that look superficially like HTML tags. However, in HTML you're limited to about a hundred predefined tags that describe web page formatting. In XML, you can create as many tags as you need. Furthermore, these tags will mostly describe the type of content they contain rather than formatting or layout information. In XML you don't say that something is italicized or indented or bold, you say that it's a book or a biography or a calendar.
Although XML is looser than HTML in regard to which tags it allows, it is much stricter about where those tags are placed and how they're written. In particular, all XML documents must be well-formed. Well-formedness rules specify constraints such as "Every start-tag must have a matching end-tag," and "Attribute values must be quoted." These rules are unbreakable, which makes parsing XML documents easier and writing them a little harder, but they still allow an almost unlimited flexibility of expression.
An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, it is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).
Example 2-1. A very simple yet complete XML document
<person>
  Alan Turing
</person>
In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, although that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Documents and XML Files
An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, it is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).
Example 2-1. A very simple yet complete XML document
<person>
  Alan Turing
</person>
In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, although that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type application/xml or text/xml. However, specific XML applications may use more specific MIME media types, such as application/mathml+xml, application/xslt+xml, image/svg+xml, text/vnd.wap.wml, or even text/html (in very special cases).
For generic XML documents, application/xml should be preferred to text/xml, although many web servers come configured out of the box to use text/xml. text/xml uses the ASCII character set as a default, which is incorrect for most XML documents.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Elements, Tags, and Character Data
The document in Example 2-1 is composed of a single element named person. The element is delimited by the start-tag <person> and the end-tag </person>. Everything between the start-tag and the end-tag of the element (exclusive) is called the element's content . The content of this element is the text:
  Alan Turing
The whitespace is part of the content, although many applications will choose to ignore it. <person> and </person> are markup . The string "Alan Turing" and its surrounding whitespace are character data . The tag is the most common form of markup in an XML document, but there are other kinds we'll discuss later.
Superficially, XML tags look like HTML tags. Start-tags begin with < and end-tags begin with </. Both of these are followed by the name of the element and are closed by >. However, unlike HTML tags, you are allowed to make up new XML tags as you go along. To describe a person, use <person> and </person> tags. To describe a calendar, use <calendar> and </calendar> tags. The names of the tags generally reflect the type of content inside the element, not how that content will be formatted.

Section 2.2.1.1: Empty elements

There's also a special syntax for empty elements, elements that have no content. Such an element can be represented by a single empty-element tag that begins with < but ends with />. For instance, in XHTML, an XMLized reformulation of standard HTML, the line-break and horizontal-rule elements are written as <br /> and <hr /> instead of <br> and <hr>. These are exactly equivalent to <br></br> and <hr></hr>, however. Which form you use for empty elements is completely up to you. However, what you cannot do in XML and XHTML (unlike HTML) is use only the start-tag—for instance <br> or <hr>—without using the matching end-tag. That would be a well-formedness error.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Attributes
XML elements can have attributes. An attribute is a name-value pair attached to the element's start-tag. Names are separated from values by an equals sign and optional whitespace. Values are enclosed in single or double quotation marks. For example, this person element has a born attribute with the value 1912-06-23 and a died attribute with the value 1954-06-07:
<person born="1912-06-23" died="1954-06-07">
  Alan Turing
</person>
This next element is exactly the same, as far as an XML parser is concerned. It simply uses single quotes instead of double quotes, puts some extra whitespace around the equals signs, and reorders the attributes.
<person died = '1954-06-07'  born = '1912-06-23' >
  Alan Turing
</person>
The whitespace around the equals signs is purely a matter of personal aesthetics. The single quotes may be useful in cases where the attribute value itself contains a double quote. Attribute order is not significant.
Example 2-4 shows how attributes might be used to encode much of the same information given in the record-like document of Example 2-2.
Example 2-4. An XML document that describes a person using attributes
<person>
  <name first="Alan" last="Turing"/>
  <profession value="computer scientist"/>
  <profession value="mathematician"/>
  <profession value="cryptographer"/>
</person>
This raises the question of when and whether one should use child elements or attributes to hold information. This is a subject of heated debate. Some informaticians maintain that attributes are for metadata about the element while elements are for the information itself. Others point out that it's not always so obvious what's data and what's metadata. Indeed, the answer may depend on where the information is put to use.
What's undisputed is that each element may have no more than one attribute with a given name. That's unlikely to be a problem for a birth date or a death date; it would be an issue for a profession, name, address, or anything else of which an element might plausibly have more than one. Furthermore, attributes are quite limited in structure. The value of the attribute is simply undifferentiated text. The division of a date into a year, month, and day with hyphens in the earlier code snippets is at the limits of the substructure that can reasonably be encoded in an attribute. An element-based structure is a lot more flexible and extensible. Nonetheless, attributes are certainly more convenient in some applications. Ultimately, if you're designing your own XML vocabulary, it's up to you to decide when to use which.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Names
The XML specification can be quite legalistic and picky at times. Nonetheless, it tries to be efficient where possible. One way it does that is by reusing the same rules for different items where possible. For example, the rules for XML element names are also the rules for XML attribute names, as well as for the names of several less common constructs. Collectively, these are referred to simply as XML names.
Element and other XML names may contain essentially any alphanumeric character. This includes the standard English letters A through Z and a through z as well as the digits 0 through 9. XML names may also include non-English letters, numbers, and ideograms, such as ö, ç, Ω, . They may also include these three punctuation characters:
_ The underscore
- The hyphen
. The period
XML names may not contain other punctuation characters such as quotation marks, apostrophes, dollar signs, carets, percent symbols, and semicolons. The colon is allowed, but its use is reserved for namespaces as discussed in Chapter 4. XML names may not contain whitespace of any kind, whether a space, a carriage return, a line feed, a nonbreaking space, and so forth. Finally, all names beginning with the string "XML" (in any combination of case) are reserved for standardization in W3C XML-related specifications.
The primary new feature in XML 1.1 is that XML names may contain characters only defined in Unicode 3.0 and later. XML 1.0 is limited to the characters defined as of Unicode 2.0. Additional scripts enabled for names by XML 1.1 include Burmese, Mongolian, Thaana, Cambodian, Yi, and Amharic. (All of these scripts are legal in text content in XML 1.0. You just can't use them to name elements, attributes, and entities.) XML 1.1 offers little to no benefit to developers who don't need to use these scripts in their markup.
XML 1.1 also allows names to contain some uncommon symbols such as the musical symbol for a six-string fretboard and even a million or so code points that aren't actually mapped to particular characters. However, taking advantage of this is highly unwise. We strongly recommend that even in XML 1.1 you limit your names to letters, digits, ideographs, and the specifically allowed ASCII punctuation marks.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References
The character data inside an element must not contain a raw unescaped opening angle bracket (<). This character is always interpreted as beginning a tag. If you need to use this character in your text, you can escape it using the entity reference &lt; , the numeric character reference &#60;, or the hexadecimal numeric character reference &#x3C;. When a parser reads the document, it replaces any &lt;, &#x60;, or &#x3C; references it finds with the actual < character. However, it will not confuse the references with the starts of tags. For example:
<SCRIPT LANGUAGE="JavaScript">
  if (location.host.toLowerCase( ).indexOf("ibiblio") &lt; 0) {
    location.href="http://ibiblio.org/xml/";
  }
</SCRIPT>
Character data may not contain a raw unescaped ampersand (&) either. This is always interpreted as beginning an entity reference. However, the ampersand may be escaped using the &amp; entity reference like this:
<company>W.L. Gore &amp; Associates</company>
The ampersand is code point 38 so it could also be written with the numeric character reference &#38;:
<company>W.L. Gore &#38; Associates</company>
Entity references such as &amp; and character references such as &#60; are markup. When an application parses an XML document, it replaces this particular markup with the actual character or characters the reference refers to.
XML predefines exactly five entity references. These are:
&lt;
The less-than sign, a.k.a. the opening angle bracket (<)
&amp;
The ampersand (&)
&gt;
The greater-than sign, a.k.a. the closing angle bracket (>)
&quot;
The straight, double quotation marks (")
&apos;
The apostrophe, a.k.a. the straight single quote (')
Only &lt; and &amp; must be used instead of the literal characters in element content. The others are optional. &quot; and &apos; are useful inside attribute values where a raw " or ' might be misconstrued as ending the attribute value. For example, this image tag uses the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
CDATA Sections
When an XML document includes samples of XML or HTML source code, the < and & characters in those samples must be encoded as &lt; and &amp;. The more sections of literal code a document includes and the longer they are, the more tedious this encoding becomes. Instead you can enclose each sample of literal code in a CDATA section . A CDATA section is set off by <![CDATA[ and ]]>. Everything between the <![CDATA[ and the ]]> is treated as raw character data. Less-than signs don't begin tags. Ampersands don't start entity references. Everything is simply character data, not markup.
For example, in a Scalable Vector Graphics (SVG) tutorial written in XHTML, you might see something like this:
<p>You can use a default <code>xmlns</code> attribute to avoid
having to add the svg prefix to all your elements:</p>
<pre><![CDATA[
       <svg xmlns="http://www.w3.org/2000/svg"
            width="12cm" height="10cm">
         <ellipse rx="110" ry="130" />
         <rect x="4cm" y="1cm" width="3cm" height="6cm" />
       </svg>
     ]]></pre>
The SVG source code has been included directly in the XHTML file without carefully replacing each < with &lt;. The result will be a sample SVG document, not an embedded SVG picture, as might happen if this example were not placed inside a CDATA section.
The only thing that cannot appear in a CDATA section is the CDATA section end delimiter, ]]> .
CDATA sections exist for the convenience of human authors, not for programs. Parsers are not required to tell you whether a particular block of text came from a CDATA section, from normal character data, or from character data that contained entity references such as &lt; and &amp;. By the time you get access to the data, these differences will have been washed away. No code you write should depend on the difference between them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Comments
XML documents can be commented so that coauthors can leave notes for each other and themselves, documenting why they've done what they've done or items that remain to be done. XML comments are syntactically similar to HTML comments. Just as in HTML, they begin with <!-- and end with the first occurrence of -->. For example:
<!-- I need to verify and update these links when I get a chance. -->
The double hyphen -- must not appear anywhere inside the comment until the closing -->. In particular, a three-hyphen close like ---> is specifically forbidden.
Comments may appear anywhere in the character data of a document. They may also appear before or after the root element. (Comments are not elements, so this does not violate the tree structure or the one-root element rules for XML.) However, comments may not appear inside a tag or inside another comment.
Applications that read and process XML documents may or may not pass along information included in comments. They are certainly free to drop them out if they choose. Do not write documents or applications that depend on the contents of comments being available. Comments are strictly for making the raw source code of an XML document more legible to human readers. They are not intended for computer programs. For this purpose, you should use a processing instruction instead.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing Instructions
In HTML, comments are sometimes abused to support nonstandard extensions. For instance, the contents of the script element are sometimes enclosed in a comment to protect it from display by a nonscript-aware browser. The Apache web server parses comments in .shtml files to recognize server-side includes. Unfortunately, these documents may not survive being passed through various HTML editors and processors with their comments and associated semantics intact. Worse yet, it's possible for an innocent comment to be misconstrued as input to the application.
XML provides the processing instruction as an alternative means of passing information to particular applications that may read the document. A processing instruction begins with <? and ends with ?>. Immediately following the <? is an XML name called the target , possibly the name of the application for which this processing instruction is intended or possibly just an identifier for this particular processing instruction. The rest of the processing instruction contains text in a format appropriate for the applications for which the instruction is intended.
For example, in HTML, a robots META tag is used to tell search-engine and other robots whether and how they should index a page. The following processing instruction has been proposed as an equivalent for XML documents:
<?robots index="yes" follow="no"?>
The target of this processing instruction is robots. The syntax of this particular processing instruction is two pseudo-attributes, one named index and one named follow, whose values are either yes or no. The semantics of this particular processing instruction are that if the index attribute has the value yes, then search-engine robots should index this page. If index has the value no, then robots should not index the page. Similarly, if follow has the value yes, then links from this document will be followed; if it has the value
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The XML Declaration
XML documents should (but do not have to) begin with an XML declaration. The XML declaration looks like a processing instruction with the name xml and with version, standalone, and encoding pseudo-attributes. Technically, it's not a processing instruction, though; it's just the XML declaration, nothing more, nothing less. Example 2-7 demonstrates.
Example 2-7. A very simple XML document with an XML declaration
<?xml version="1.0" encoding="ASCII" standalone="yes"?>
<person>
  Alan Turing
</person>
XML documents do not have to have an XML declaration. However, if an XML document does have an XML declaration, then that declaration must be the first thing in the document. It must not be preceded by any comments, whitespace, processing instructions, and so forth. The reason is that an XML parser uses the first five characters (<?xml) to make some reasonable guesses about the encoding, such as whether the document uses a single-byte or multibyte character set. The only thing that may precede the XML declaration is an invisible Unicode byte-order mark. We'll discuss this further in Chapter 5.
The version attribute should have the value 1.0. Under very unusual circumstances, it may also have the value 1.1. Since specifying version="1.1" limits the document to the most recent versions of only a couple of parsers, and since all XML 1.1 parsers must also support XML 1.0, you don't want to casually set the version to 1.1.
Don't believe us? First answer a couple of questions:
  1. Do you speak Cambodian, Burmese, Amharic, Mongolian, or Divehi?
  2. Does your data contain obsolete, nontext C0 control characters such as vertical tab, form feed, or bell?
If you answered no to both of these questions, you have absolutely nothing to gain by using XML 1.1. If you answered yes to either one, then you may have cause to use XML 1.1. XML 1.0 allows Cambodian, Burmese, Amharic, etc. to be used in character data and attribute values. XML 1.1 also allows these scripts to be used in element and attribute names, which XML 1.0 does not. XML 1.1 also allows C0 control characters (except null) to be used in character data and attribute values (provided they're escaped as numeric character references like
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Checking Documents for Well-Formedness
Every XML document, without exception, must be well-formed. This means it must adhere to a number of rules, including the following:
  1. Every start-tag must have a matching end-tag.
  2. Elements may nest but may not overlap.
  3. There must be exactly one root element.
  4. Attribute values must be quoted.
  5. An element may not have two attributes with the same name.
  6. Comments and processing instructions may not appear inside tags.
  7. No unescaped < or & signs may occur in the character data of an element or attribute.
This is not an exhaustive list. There are many, many ways a document can be malformed. You'll find a complete list in Chapter 21. Some of these involve constructs that we have not yet discussed, such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).
Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant. It can't fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that's inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that plagued early web browsers and continue to this day. Consequently, before you publish an XML document—whether that document is a web page, input to a database, or something else—you'll want to check it for well-formedness.
The simplest way to do this is by loading the document into a web browser that understands XML documents, such as Mozilla. If the document is well-formed, the browser will display it. If it isn't, then it will show an error message.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Document Type Definitions (DTDs)
While XML is extremely flexible, not all the programs that read particular XML documents are so flexible. Many programs can work with only some XML applications but not others. For example, Adobe Illustrator can read and write Scalable Vector Graphics (SVG) files, but you wouldn't expect it to understand a Platform for Privacy Preferences (P3P) document. And within a particular XML application, it's often important to ensure that a given document adheres to the rules of that XML application. For instance, in XHTML, li elements should only be children of ul or ol elements. Browsers may not know what to do with them, or may act inconsistently, if li elements appear in the middle of a blockquote or p element.
XML 1.0 provides a solution to this dilemma: a document type definition (DTD). DTDs are written in a formal syntax that explains precisely which elements may appear where in the document and what the elements' contents and attributes are. A DTD can make statements such as "A ul element only contains li elements" or "Every employee element must have a social_security_number attribute." Different XML applications can use different DTDs to specify what they do and do not allow.
A validating parser compares a document to its DTD and lists any places where the document differs from the constraints specified in the DTD. The program can then decide what it wants to do about any violations. Some programs may reject the document. Others may try to fix the document or reject just the invalid element. Validation is an optional step in processing XML. A validity error is not necessarily a fatal error like a well-formedness error, although some applications may choose to treat it as one.
A valid document includes a document type declaration that identifies the DTD that the document satisfies. The DTD lists all the elements, attributes, and entities the document uses and the contexts in which it uses them. The DTD may list items the document does not use as well. Validity operates on the principle that everything not permitted is forbidden. Everything in the document must match a declaration in the DTD. If a document has a document type declaration and the document satisfies the DTD that the document type declaration indicates, then the document is said to be
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Validation
A valid document includes a document type declaration that identifies the DTD that the document satisfies. The DTD lists all the elements, attributes, and entities the document uses and the contexts in which it uses them. The DTD may list items the document does not use as well. Validity operates on the principle that everything not permitted is forbidden. Everything in the document must match a declaration in the DTD. If a document has a document type declaration and the document satisfies the DTD that the document type declaration indicates, then the document is said to be valid. If it does not, it is said to be invalid.
There are many things the DTD does not say. In particular, it does not say the following:
  • What the root element of the document is
  • How many of instances of each kind of element appear in the document
  • What the character data inside the elements looks like
  • The semantic meaning of an element; for instance, whether it contains a date or a person's name
DTDs allow you to place some constraints on the form an XML document takes, but there can be quite a bit of flexibility within those limits. A DTD never says anything about the length, structure, meaning, allowed values, or other aspects of the text content of an element or attribute.
Validity is optional. A parser reading an XML document may or may not check for validity. If it does check for validity, the program receiving data from the parser may or may not care about validity errors. In some cases, such as feeding records into a database, a validity error may be quite serious, indicating that a required field is missing, for example. In other cases, rendering a web page perhaps, a validity error may not be so important, and a program can work around it. Well-formedness is required of all XML documents; validity is not. Your documents and your programs can use validation as you find needful.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Element Declarations
Every element used in a valid document must be declared in the document's DTD with an element declaration. Element declarations have this basic form:
<!ELEMENT name 
               content_specification>
The name of the element can be any legal XML name. The content specification indicates what children the element may or must have and in what order. Content specifications can be quite complex. They can say, for example, that an element must have three child elements of a given type, or two children of one type followed by another element of a second type, or any elements chosen from seven different types interspersed with text.
The simplest content specification is one that says an element may only contain parsed character data, but may not contain any child elements of any type. In this case the content specification consists of the keyword #PCDATA inside parentheses. For example, this declaration says that a phone_number element may contain text but may not contain elements:
<!ELEMENT phone_number (#PCDATA)>
Such an element may also contain character references and CDATA sections (which are always parsed into pure text) and comments, and processing instructions (which don't really count in validation). It may contain entity references only if those entity references resolve to plain text without any child elements.
Another simple content specification is one that says the element must have exactly one child of a given type. In this case, the content specification consists of the name of the child element inside parentheses. For example, this declaration says that a fax element must contain exactly one phone_number element:
<!ELEMENT fax (phone_number)>
A fax element may not contain anything else except the phone_number element, and it may not contain more or less than one of those.
In practice, a content specification that lists exactly one child element is rare. Most elements contain either parsed character data or (at least potentially) multiple child elements. The simplest way to indicate multiple child elements is to separate them with commas. This is called a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Attribute Declarations
In addition to declaring its elements, a valid document must declare all the elements' attributes. This is done with ATTLIST declarations. A single ATTLIST can declare multiple attributes for a single element type. However, if the same attribute is repeated on multiple elements, then it must be declared separately for each element where it appears. (Later in this chapter you'll see how to use parameter entity references to make this repetition less burdensome.)
For example, ATTLIST declares the source attribute of the image element:
<!ATTLIST image source CDATA #REQUIRED>
It says that the image element has an attribute named source. The value of the source attribute is character data, and instances of the image element in the document are required to provide a value for the source attribute.
A single ATTLIST declaration can declare multiple attributes for the same element. For example, this ATTLIST declaration not only declares the source attribute of the image element, but also the width, height, and alt attributes:
<!ATTLIST image source CDATA #REQUIRED
                width  CDATA #REQUIRED
                height CDATA #REQUIRED
                alt    CDATA #IMPLIED
>
This declaration says the source, width, and height attributes are required. However, the alt attribute is optional and may be omitted from particular image elements. All four attributes are declared to contain character data, the most generic attribute type.
This declaration has the same effect and meaning as four separate ATTLIST declarations, one for each attribute. Whether to use one ATTLIST declaration per attribute is a matter of personal preference, but most experienced DTD designers prefer the multiple-attribute form. Given judicious application of whitespace, it's no less legible than the alternative.
In merely well-formed XML, attribute values can be any string of text. The only restrictions are that any occurrences of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
General Entity Declarations
As you learned in Chapter 2, XML predefines five entities for your convenience:
&lt;
The less-than sign, a.k.a. the opening angle bracket (<)
&amp;
The ampersand (&)
&gt;
The greater-than sign, a.k.a. the closing angle bracket (>)
&quot;
The straight, double quotation marks (")
&apos;
The apostrophe, a.k.a. the straight single quote (')
The DTD can define many more entities. This is useful not just in valid documents, but even in documents you don't plan to validate.
Entity references are defined with an ENTITY declaration in the DTD. This gives the name of the entity, which must be an XML name, and the replacement text of the entity. For example, this entity declaration defines &super; as an abbreviation for supercalifragilisticexpialidocious:
<!ENTITY super "supercalifragilisticexpialidocious">
Once that's done, you can use &super; anywhere you'd normally have to type the entire word (and probably misspell it).
Entities can contain markup as well as text. For example, this declaration defines &footer; as an abbreviation for a standard web page footer that will be repeated on many pages:
<!ENTITY footer '<hr size="1" noshade="true"/>
<font CLASS="footer">
<a href="index.html">O&apos;Reilly Home</a> |
<a href="sales/bookstores/">O&apos;Reilly Bookstores</a> |
<a href="order_new/">How to Order</a> |
<a href="oreilly/contact.html">O&apos;Reilly Contacts</a><br>
<a href="http://international.oreilly.com/">International</a> |
<a href="oreilly/about.html">About O&apos;Reilly</a> |
<a href="affiliates.html">Affiliated Companies</a>
</font>
<p>
<font CLASS="copy">
Copyright 2004, O&apos;Reilly Media, Inc.<br/>
<a href="mailto:webmaster@oreilly.com">webmaster@oreilly.com</a>
</font>
</p>
'>
The entity replacement text must be well-formed. For instance, you cannot put a start-tag in one entity and the corresponding end-tag in another entity.
The other thing you have to be careful about is that you need to use different quote marks inside the replacement text from the ones that delimit it. Here we've chosen single quotes to surround the replacement text and double quotes internally. However, we did have to change the single quote in "O'Reilly" to the predefined general entity reference
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
External Parsed General Entities
The footer example is about at the limits of what you can comfortably fit in a DTD. In practice, web sites prefer to store repeated content like this in external files and load it into their pages using PHP, server-side includes, or some similar mechanism. XML supports this technique through external general entity references, although in this case the client, rather than the server, is responsible for integrating the different pieces of the document into a coherent whole.
An external parsed general entity reference is declared in the DTD using an ENTITY declaration. However, instead of the actual replacement text, the SYSTEM keyword and a URL to the replacement text is given. For example:
<!ENTITY footer SYSTEM "http://www.oreilly.com/boilerplate/footer.xml">
Of course, a relative URL will often be used instead. For example:
<!ENTITY footer SYSTEM "/boilerplate/footer.xml">
In either case, when the general entity reference &footer; is seen in the character data of an element, the parser may replace it with the document found at http://www.oreilly.com/boilerplate/footer.xml. References to external parsed entities are not allowed in attribute values. Most of the time this shouldn't be too big a hassle because attribute values tend to be small enough to be easily included in internal entities.
Notice we wrote that the parser may replace the entity reference with the document at the URL, not that it must. This is an area where parsers have some leeway in just how much of the XML specification they wish to implement. A validating parser must retrieve such an external entity. However, a nonvalidating parser may or may not choose to retrieve the entity.
Furthermore, not all text files can serve as external entities. In order to be loaded in by a general entity reference, the document must be potentially well-formed when inserted into an existing document. This does not mean the external entity itself must be well-formed. In particular, the external entity might not have a single root element. However, if such a root element were wrapped around the external entity, then the resulting document should be well-formed. This means, for example, that all elements that start inside the entity must finish inside the same entity. They cannot finish inside some other entity. Furthermore, the external entity does not have a prolog and, therefore, cannot have an XML declaration or a document type declaration.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
External Unparsed Entities and Notations
Not all data is XML. There are a lot of ASCII text files in the world that don't give two cents about escaping < as &lt; or adhering to the other constraints by which an XML document is limited. There are probably even more JPEG photographs, GIF line art, QuickTime movies, MIDI sound files, and so on. None of these are well-formed XML, yet all of them are necessary components of many documents.
The mechanism that XML suggests for embedding these things in documents is the external unparsed entity. The DTD specifies a name and a URL for the entity containing the non-XML data. For example, this ENTITY declaration associates the name turing_getting_off_bus with the JPEG image at http://www.turing.org.uk/turing/pi1/busgroup.jpg:
<!ENTITY turing_getting_off_bus
         SYSTEM "http://www.turing.org.uk/turing/pi1/busgroup.jpg"
         NDATA jpeg>
Since the data in the previous code is not in XML format, the NDATA declaration specifies the type of the data. Here the name jpeg is used. XML does not recognize this as meaning an image in a format defined by the Joint Photographs Experts Group. Rather this is the name of a notation declared elsewhere in the DTD using a NOTATION declaration like this:
<!NOTATION jpeg SYSTEM "image/jpeg">
Here we've used the MIME media type image/jpeg as the external identifier for the notation. However, there is absolutely no standard or even a suggestion for exactly what this identifier should be. Individual applications must define their own requirements for the contents and meaning of notations.
The DTD only declares the existence, location, and type of the unparsed entity. To actually include the entity in the document at one or more locations, you insert an element with an ENTITY type attribute whose value is the name of an unparsed entity declared in the DTD. You do not use an entity reference like &turing_getting_off_bus;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parameter Entities
Content preview·