Learning XML, 2nd Edition

What Is XML?

XML is a lot like the ubiquitous plastic containers of Tupperware®. There is really no better way to keep your food fresh than with those colorful, airtight little boxes. They come in different sizes and shapes so you can choose the one that fits best. They lock tight so you know nothing is leaking out and germs can’t get in. You can tell items apart based on the container’s color, or even scribble on it with magic marker. They’re stackable and can be nested in larger containers (in case you want to take them with you on a picnic). Now, if you think of information as a precious commodity like food, then you can see the need for a containment system like Tupperware^®.

An Information Container

XML contains, shapes, labels, structures, and protects information. It does this with symbols embedded in the text, called markup. Markup enhances the meaning of information in certain ways, identifying the parts and how they relate to each other. For example, when you read a newspaper, you can tell articles apart by their spacing and position on the page and the use of different fonts for titles and headings. Markup works in a similar way, except that instead of spaces and lines, it uses symbols.

Markup is important to electronic documents because they are processed by computer programs. If a document has no labels or boundaries, then a program will not know how to distinguish a piece of text from any other piece. Essentially, the program would have to work with the entire document as a unit, severely limiting the interesting things you can do with the content. A newspaper with no space between articles and only one text style would be a huge, uninteresting blob of text. You could probably figure out where one article ends and another starts, but it would be a lot of work. A computer program wouldn’t be able to do even that, since it lacks all but the most rudimentary pattern-matching skills.

XML’s markup divides a document into separate information containers called elements . Like Tupperware^® containers, they seal up the data completely, label it, and provide a convenient package for computer processing. Like boxes, elements nest inside other elements. One big element may contain a whole bunch of elements, which in turn contain other elements, and so on down to the data. This creates an unambiguous hierarchical structure that preserves all kinds of ancillary information: sequence, ownership, position, description, association. An XML document consists of one outermost element that contains all the other elements, plus some optional administrative information at the top.

Example 1-1 is a typical XML document containing a short telegram. Take a moment to dissect it with your eyes and then we’ll walk through it together.

Example 1-1. An XML document

<?xml version="1.0"?>
<!DOCTYPE telegram SYSTEM "/xml-resources/dtds/telegram.dtd">
<telegram pri="important">
  <to>Sarah Bellum</to>
  <from>Colonel Timeslip</from>
  <subject>Robot-sitting instructions</subject>
  <graphic fileref="figs/me.eps"/>
  <message>Thanks for watching my robot pal 
    <name>Zonky</name> while I'm away. 
    He needs to be recharged <emphasis>twice a
    day</emphasis> and if he starts to get cranky, 
    give him a quart of oil. I'll be back soon, 
    after I've tracked down that evil 
    mastermind <villain>Dr. Indigo Riceway</villain>.
  </message>
</telegram>

Can you tell the difference between the markup and the data? The markup symbols are delineated by angle brackets (<>). <to> and </villain> are two such symbols, called tags . The data, or content, fills the space between these tags. As you get used to looking at XML, you’ll use the tags as signposts to navigate visually through documents.

At the top of the document is the XML declaration, <?xml version="1.0"?>. This helps an XML-processing program identify the version of XML, and what kind of character encoding it has, helping the XML processor to get started on the document. It is optional, but a good thing to include in a document.

After that comes the document type declaration, containing a reference to a grammar-describing document, located on the system in the file /xml-resources/dtds/telegram.dtd. This is known as a document type definition (DTD). <!DOCTYPE...> is one example of a type of markup called a declaration . Declarations are used to constrain grammar and declare pieces of text or resources to be included in the document. This line isn’t required unless you want a parser to validate your document’s structure against a set of rules you provide in the DTD.

Next, we see the <telegram> tag. This is the start of an element. We say that the element’s name or type (not to be confused with a data type) is “telegram,” or you could just call it a “telegram element.” The end of the element is at the bottom and is represented by the tag </telegram> (note the slash at the beginning). This element contains all of the contents of the document. No wonder, then, that we call it the document element. (It is also sometimes called the root element .) Inside, you’ll see more elements with start tags and end tags following a similar pattern.

There is one exception here, the empty tag <graphic.../>, which represents an empty element. Rather than containing data, this element references some other information that should be used in its place, in this case a graphic to be displayed. Empty elements do not mark boundaries around text and other elements the way container elements do, but they still may convey positional information. For example, you might place the graphic inside a mixed-content element, such as the message element in the example, to place the graphic at that position in the text.

Every element that contains data has to have both a start tag and an end tag or the empty form used for graphic. (It’s okay to use a start tag immediately followed by an end tag for an empty element; the empty tag is effectively an abbreviation of that.) The names in start and end tags have to match exactly, even down to the case of the letters. XML is very picky about details like this. This pickiness ensures that the structure is unambiguous and the data is airtight. If start tags or end tags were optional, the computer (or even a human reader) wouldn’t know where one element ended and another began, causing problems with parsing.

From this example, you can see a pattern: some tags function as bookends, marking the beginning and ending of regions, while others mark a place in the text. Even the simple document here contains quite a lot of information:

Boundaries: A piece of text starts in one place and ends in another. The tags <telegram> and </telegram> define the start and end of a collection of text and markup.
Roles: What is a region of text doing in the document? Here, the tags <name> and </name> give an obvious purpose to the content of the element: a name, as opposed to any other kind of inline text such as a date or emphasis.
Positions: Elements preserve the order of their contents, which is especially important in prose documents like this.
Containment: The nesting of elements is taken into account by XML-processing software, which may treat content differently depending on where it appears. For example, a title might have a different font size depending on whether it’s the title of a newspaper or an article.
Relationships: A piece of text can be linked to a resource somewhere else. For instance, the tag <graphic.../> creates a relationship (link) between the XML fragment and a file named me.eps. The intent is to import the graphic data from the file and display it in this fragment.

An important XML term to understand is document. When you hear that word, you probably think of a sequence of words partitioned into paragraphs, sections, and chapters, comprising a human-readable record such as a book, article, or essay. But in XML, a document is even more general: it’s the basic unit of XML information, composed of elements and other markup in an orderly package. It can contain text such as a story or article, but it doesn’t have to. Instead, it might consist of a database of numbers, or some abstract structure representing a molecule or equation. In fact, one of the most promising applications of XML is as a format for application-to-application data exchange. Keep in mind that an XML document can have a much wider definition than what you might think of as a traditional document. The following are short examples of documents.

The Mathematics Markup Language (MathML) encodes equations. A well-known equation among physicists is Newton’s Law of Gravitation: F = GMm / r2. The document in Example 1-2 represents that equation.

Example 1-2. A MathML document

<?xml version="1.0"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>F</mi>
  <mo>=</mo>
  <mi>G</mi>
  <mo>&InvisibleTimes;</mo>
  <mfrac>
    <mrow>
      <mi>M</mi>
      <mo>&InvisibleTimes;</mo>
      <mi>m</mi>
    </mrow>
    <apply>
      <power/>
      <ci>r</ci>
      <cn>2</cn>
    </apply>
  </mfrac>
</math>

While one application might use this input to display the equation, another might use it to solve the equation with a series of values. That’s a sign of XML’s power.

You can also store graphics in XML documents. The Scalable Vector Graphics (SVG) language is used to draw resizable line art. The document in Example 1-3 defines a picture with three shapes (a rectangle, a circle, and a polygon).

Example 1-3. An SVG document

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg 
   PUBLIC "-//W3C//DTD SVG 20001102//EN" 
   "http://www.w3.org/TR/2000/CR-SVG-20001102/DTD/svg-20001102.dtd">
<svg>
  <desc>Three shapes</desc>
  <rect fill="green" x="1cm" y="1cm" width="3cm" height="3cm"/>
  <circle fill="red" cx="3cm" cy="2cm" r="4cm"/>
  <polygon fill="blue" points="110,160 50,300 180,290"/>
</svg>

It’s also worth noting that a document is not necessarily the same as a file. A file is a package of data treated as a contiguous unit by the computer’s operating system. This is called a physical structure. An XML document can exist in one file or in many files, some of which may be on another system. It may not be in a file at all, but generated in a stream from a program. XML uses special markup to integrate the contents of different files to create a single entity, which we describe as a logical structure. By keeping a document independent of the restrictions of a file, XML facilitates a linked web of document parts which can reside anywhere.

That’s XML markup in a nutshell. The whole of the next chapter is devoted to this topic. There, we’ll go into deeper detail about the picky rules and describe some new components you haven’t seen yet. You’ll then be able to tear apart any XML document and know what all the pieces are for, and put together documents of your own.

A Markup Language Toolkit

Strictly speaking, XML is not a markup language. A language has a fixed vocabulary and grammar, but XML doesn’t actually define any elements. Instead, it lays down a foundation of syntactic constraints on which you can build your own language. So a more apt description might be to call XML a markup language toolkit. When you need a markup language, you can build one quickly using XML’s rules, and you’ll be comfortable knowing that it will automatically be compatible with all the generic XML tools out there.

The telegram in Example 1-1 is marked up in a language I invented for fun. I chose a bunch of element names that I thought were important to represent a typical telegram, and a couple that were gratuitously silly, like villain. This is okay, because the language is for my use and I can do whatever I want with it. Perhaps I have something in mind for the villain element, like printing it in a different color to stand out. The point is that XML gives me the ability to tailor a markup language any way I want, which is a very powerful feature.

Well-formedness

Because XML doesn’t have a predetermined vocabulary, it’s possible to invent a markup language as you go along. Perhaps in a future telegram I want to identify a new kind of thing with an element I’ve never used before. Say I wrote to a friend inviting her to a party, and I enclosed the date in an element called, appropriately, date. Free-form XML, as I like to call it, is perfectly legal as long as it’s well-formed. In other words, as long as you spell tags correctly, use both start tags and end tags, and obey all the other minimal rules, it’s good XML.

Documents that follow the syntax rules of XML are well-formed XML documents. This piece of text would fail the test on three counts:

<equation<a < b<equation>

Can you find all the problems? First, the start tag is spelled incorrectly, because it has two left brackets instead of a left and a right. Second, there is a left bracket in the content of the element, which is illegal. Third, the end tag of the element is missing a slash. This is not well-formed XML. Any program that parses it should stop at the first error and refuse to have anything more to do with it.

Well-formedness is XML’s “purity test.” What does this get us? Compatibility. It allows you to write a program or a library of routines that know nothing about the incoming data except that it will be well-formed XML. An XML editor could be used to edit any XML document, a browser to view any document, and so on. Programs are more robust and less complex when the data is more consistent.

Validity

Some programs are not so general-purpose, however. They may perform complex operations on highly specific data. In this case, you may need to concretize your markup language so that a user doesn’t slip in an unexpected element type and confuse the program. What you need is a formal document model. A document model is the blueprint for an instance of a markup language. It gives you an even stricter test than well-formedness, so you can say that Document X is not just well-formed XML, but it’s also an instance of the Mathematics Markup Language, for example.

When a document instance matches a document model, we say that it is valid. You may hear it phrased as, “this is valid XHTML” or “valid SVG.” The markup languages (e.g., XHTML and SVG) are applications of XML. Today, there are hundreds of XML applications for encoding everything from plays to chemical formulae. If you’re in the market for a markup language, chances are you’ll find one that meets your needs. If not, you can always make your own. That’s the power of XML.

There are several ways to define a markup language formally. The two most common are document type definitions (DTDs) and schemas. Each has its strong points and weak points.

Document type definitions (DTDs)

DTDs are built into the XML 1.0 specification. They are usually separate documents that your document can refer to, although parts of DTDs can also reside inside your document. A DTD is a collection of rules, or declarations, describing elements and other markup objects. An element declaration adds a new element type to the vocabulary and defines its content model, what the element can contain and in which order. Any element type not declared in the DTD is illegal. Any element containing something not declared in the DTD is also illegal. The DTD doesn’t restrict what kind of data can go inside elements, which is the primary flaw of this kind of document model.

Schemas

Schemas are a later invention, offering more flexibility and a way to specify patterns for data, which is absent from DTDs. For example, in a schema you could declare an element called date and then require that it contains a legal date in the format YYYY-MM-DD. With DTDs the best you could do is say whether the element can contain characters or elements. Unfortunately, there is a lot of controversy around schemas because different groups have put forth competing proposals. Perhaps there will always be different types of schemas, which is fine with me.

An Open Standard

As Andrew Tanenbaum, a famous networking researcher, once said, “The wonderful thing about standards is that there are so many of them.” We’ve all felt a little bewildered by all the new standards that support the information infrastructure. But the truth is, standards work, and without them the world would be a much more confusing place. From Eli Whitney’s interchangeable gun parts to standard railroad gauges, the industrial revolution couldn’t have happened without them.

The best kind of standard is one that is open. An open standard is not owned by any single company or individual. It is designed and maintained based on input from the community to fit real needs, not to satisfy a marketing agenda. Therefore, it isn’t subject to change without notice, nor is it tied to the fortune of a company that could disappear in the next market downturn. There are no licensing fees, nondisclosure agreements, partnerships, or intellectual property disputes. It’s free, public, and completely transparent.

The Internet was largely built upon open standards. IP, TCP, ASCII, email, HTML, Telnet, FTP—they are all open, even if they were funded by private and government organizations. Developers like open standards because they can have a say in how they are designed. They are free to use what works for them, rather than be tied to a proprietary package. And history shows that they work remarkably well.

XML is an open standard. It was designed by a group of companies, organizations, and individuals called the World Wide Web Consortium (W3C). The current recommendation was published in 1998, with a second edition published in 2000, although a new version (1.1, which modifies the list of allowable characters) is currently in the draft stage. The specification is free to the public, on the web at http://www.w3.org/TR/REC-xml. As a recommendation, it isn’t strictly binding. There is no certification process, but developers are motivated to comply as closely as possible to attract customers and community approval.

In one sense, a loosely binding recommendation is useful, in that standards enforcement takes time and resources that no one in the consortium wants to spend. It also allows developers to create their own extensions, or to make partially working implementations that do a pretty good job. The downside, however, is that there’s no guarantee anyone will do a really good job. For example, the Cascading Style Sheets standard has languished for years because browser manufacturers couldn’t be bothered to fully implement it. Nevertheless, the standards process is generally a democratic and public-focused process, which is a Good Thing.

The W3C has taken on the role of the unofficial smithy of the Web. Founded in 1994 by a number of organizations and companies around the world with a vested interest in the Web, their long-term goal is to research and foster accessible and superior web technology with responsible application. They help to banish the chaos of competing, half-baked technologies by issuing technical documents and recommendations to software vendors and end users alike.

Every recommendation that goes up on the W3C’s web site must endure a long, tortuous process of proposals and revisions before it’s finally ratified by the organization’s Advisory Committee. A recommendation begins as a project, or activity, when somebody sends the W3C Director a formal proposal called a briefing package. If approved, the activity gets its own working group with a charter to start development work. The group quickly nails down details such as filling leadership positions, creating the meeting schedule, and setting up necessary mailing lists and web pages.

At regular intervals, the group issues reports of its progress, posted to a publicly accessible web page. Such a working draft does not necessarily represent a finished work or consensus among the members, but is rather a progress report on the project. People in the community are welcome to review it and make comments. Developers start to implement parts of the proposed technology to test it out, finding problems in the process. Software vendors press for more features. All this feedback is important to ensure work is going in the right direction and nothing important has been left out particularly when the last call working draft is out.

The draft then becomes a candidate recommendation. At this stage, the working group members are satisfied that the ideas are essentially sound and no major changes will be needed. Experts will continue to weigh in with their insights, mostly addressing details and small mistakes. The deadline for comments finally arrives and the working group goes back to work, making revisions and changes.

Satisfied that the group has something valuable to contribute to the world, the Director takes the candidate recommendation and blesses it into a proposed recommendation. It must then survive the scrutiny of the Advisory Committee and perhaps be revised a little more before it finally graduates into a recommendation.

The whole process can take years to complete, and until the final recommendation is released, you shouldn’t accept anything as gospel. Everything can change overnight as the next draft is posted, and many a developer has been burned by implementing the sketchy details in a working draft, only to find that the actual recommendation is a completely different beast. If you’re an end user, you should also be careful. You may believe that the feature you need is coming, only to find it was cut from the feature list at the last minute.

It’s a good idea to visit the W3C’s web site (http://www.w3.org) every now and then. You’ll find news and information about evolving standards, links to tutorials, and pointers to software tools. It’s listed, along with some other favorite resources, in Appendix B.

A Constellation of Standards

Many people agree that spending money is generally more fun than saving it. Sure, you can get a little thrill looking at your bank statement and seeing the dividend from the 3% interest on your savings account, but it isn’t as exciting as buying a new plasma screen television. So it is with XML. It contains information like a safe holds money, but the real fun comes from using that information. Whether you’re publishing an XHTML document to the Web or generating an image from SVG, the results are much more gratifying than staring at markup.

XML’s extended family provides many ways to squeeze usefulness out of XML documents. They are extensions and applications of XML that build bridges to other formats or make it easier to work with data. All the names and acronyms may be a little overwhelming at first, but it’s worth getting to know this growing family.

Let’s look at these categories in more detail.

Core syntax: These are the minimal standards required to understand XML. They include the core recommendation and its extension, Namespaces in XML. The latter piece allows you to classify markup in different groups. One use of this is to combine markup from different XML applications in the same document. The core syntax of XML will be covered thoroughly in Chapter 2.
Human documents: This category has markup languages for documents you’ll actually read, as opposed to raw data. XHTML, the XML-friendly upgrade to the Hypertext Markup Language, is used to encode web pages. DocBook is for technical manuals which are heavy in technical terms and complex structures like tables, lists, and sidebars. The Wireless Markup Language (WML) is somewhat like XHTML but specializes in packaging documents for tiny screens on cellular phones. We will discuss this narrative style of document in Chapter 3.
Modeling: In this group are all the technologies developed to create models of documents that formalize a markup language and can be used to test document instances against standard grammars. These include DTDs (part of the core XML 1.0 recommendation), the W3C’s XML Schema, RELAX NG, and Schematron, all of which will be covered in Chapter 4.
Locating and linking: Data is only as useful as it is easy to access it. That’s why there is a slew of protocols available for getting to data deep inside documents. XPath provides a language for specifying the path to a piece of data. XPointer and XLink use these paths to create a link from one document to another. XInclude imports files into a document. The XML Query Language (XQuery), still in drafts, creates an XML interface for non-XML data sources, essentially turning databases into XML documents. We will explore XPath and XPointer in Chapter 6.
Presentation: XML isn’t very pretty to look at directly. If you want to make it presentable, you need to use a stylesheet. The two most popular are Cascading Style Sheets (CSS) and the Extensible Style Language (XSL). The former is very simple and fine for most online documents. The latter is highly detailed and better for print-quality documents. CSS is the topic of Chapter 5. We will take two chapters to talk about XSL: Chapter 7 and Chapter 8.
Media: Not all data is meant to be read . The Scalable Vector Graphics language (SVG) creates images and animations. The Synchronized Multimedia Integration Language (SMIL) scripts graphic, sound, and text events in a timeline-based multimedia presentation. VoiceML describes how to turn text into speech and script interactions with humans.
Science: Scientific applications have been early adopters of XML. The Chemical Markup Language (CML) represents molecules in XML, while MathML builds equations. Software turns instances of these markup languages into the nicely rendered visual representations that scientists are accustomed to viewing.
Resource description: With so many documents now online, we need ways to sort through them all to find just the information we need. Resource description is a way of summarizing and showing relationships between documents. The Resource Description Framework (RDF) is a language for describing resources.
Communication: XML is an excellent way for different systems to communicate with each other. Interhost system calls are being standardized through applications like XML-RPC, SOAP, WSDL, and UDDI. XML Signatures ensures security in identification by encoding unique, verifiable signatures for documents of any kind. SyncML is a way to transfer data from a personal computer to a smaller device like a cellular phone, giving you a fast and dependable way to update address lists and calendars.
Transformation: Converting between one format and another is a necessary fact of life. If you’ve ever had to import a document from one software application into another, you know that it can sometimes be a messy task. Extensible Style Language Transformations (XSLT) can automate the task for you. It turns one form of XML into another in a process called transformation. It is essentially a programming language, but optimized for traversing and building XML trees. Transformation is the topic of Chapter 7.
Development: When all else fails, you can always fall back on programming. Most programming languages have support for parsing and navigating XML. They frequently make use of two standard interfaces. The Simple API for XML (SAX) is very popular for its simplicity and efficiency. The Document Object Model (DOM) outlines an interface for moving around an object tree of a document for more complex processing. Programming with XML will be the last topic visited in this book, in Chapter 10.