Java and XML, 3rd Edition

Chapter 1. Introduction

In the next two chapters, I’m going to give you a crash course in XML and constraints. Since there is so much material available on XML and related specifications, I’d rather cruise through this material quickly and get on to Java. For those of you who are completely new to XML, you might want to have a few of the following books around as reference:

XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means

Learning XML, by Erik Ray

Learning XSLT, by Michael Fitzgerald

XSLT, by Doug Tidwell

These are all O’Reilly books, and I have them scattered about my own workspace. With that said, let’s dive in.

XML 1.0

It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I’ll use it to illustrate several important concepts.

Example 1-1. A typical XML document is long and verbose

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <!--Generated by Blogger v5.0-->
  <channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
    <title>Neil Gaiman's Journal</title>
    <link>http://www.neilgaiman.com/journal/journal.asp</link>
    <description>Neil Gaiman's Journal</description>
    <dc:date>2005-04-30T01:57:38Z</dc:date>
    <dc:language>en-US</dc:language>
    <admin:generatorAgent rdf:resource="http://www.blogger.com/" />
    <admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
    <items>
      <rdf:Seq>
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
      </rdf:Seq>
    </items>
  </channel>

  <!-- and so on... -->
</rdf:RDF>

Tip

For those of you who are curious, this is the RSS feed for Neil Gaiman’s blog (http://www.neilgaiman.com). It uses a lot of RSS syntax, which I’ll cover in Chapter 12 in detail.

A lot of this specification describes what is mostly intuitive. If you’ve done any HTML authoring, or SGML, you’re already familiar with the concept of elements (such as items and channel in Example 1-1) and attributes (such as resource and content). XML defines how to use these items and how a document must be structured. XML spends more time defining tricky issues like whitespace than introducing any concepts that you’re not at least somewhat familiar with. One exception may be that some of the elements in Example 1-1 are in the form:

[prefix]:[element name]

Such as rdf:li. These are elements in an XML namespace, something I’ll explain in detail shortly.

An XML document can be broken into two basic pieces: the header, which gives an XML parser and XML applications information about how to handle the document, and the content, which is the XML data itself. Although this is a fairly loose division, it helps us differentiate the instructions to applications within an XML document from the XML content itself, and is an important distinction to understand. The header is simply the XML declaration, in this format:

<?xml version="1.0" encoding="UTF-8"?>

This header includes an encoding, and can also indicate whether the document is a standalone document or requires other documents to be referenced for a complete understanding of its meaning:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The rest of the header is made up of items like the DOCTYPE declaration (not included in the example):

<!DOCTYPE RDF SYSTEM "DTDs/RDF-gaiman.dtd">

In this case, the declaration refers to a file on the local system, in the directory DTDs/ called RDF-gaiman.dtd. Any time you use a relative or absolute file path or a URL, you want to use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. As an example, take the DTD statement for XHTML 1.0:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Here, a public identifier is supplied (the funny little string starting with -//), followed by a system identifier (the URL). If the public identifier cannot be resolved, the system identifier is used instead.

You may also see processing instructions at the top of a file, and they are generally considered part of a document’s header, rather than its content. They look like this:

<?xml-stylesheet href="XSL/JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL/JavaXML.wml.xsl" type="text/xsl" 
                 media="wap"?>
<?cocoon-process type="xslt"?>

Each is considered to have a target (the first word, like xml-stylesheet or cocoon-process) and data (the rest). Often, the data is in the form of name-value pairs, which can really help readability. This is only a good practice, though, and not required, so don’t depend on it.

Other than that, the bulk of your XML document should be content; in other words, elements, attributes, and data that you have put into it.

The Root Element

The root element is the highest-level element in the XML document, and must be the first opening tag and the last closing tag within the document. It provides a reference point that enables an XML parser or XML-aware application to recognize a beginning and end to an XML document. In Example 1-1, the root element is RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <!-- Document content -->
</rdf:RDF>

This tag and its matching closing tag surround all other data content within the XML document. XML specifies that there may be only one root element in a document. In other words, the root element must enclose all other elements within the document. Aside from this requirement, a root element does not differ from any other XML element. It’s important to understand this, because XML documents can reference and include other XML documents. In these cases, the root element of the referenced document becomes an enclosed element in the referring document and must be handled normally by an XML parser. Defining root elements as standard XML elements without special properties or behavior allows document inclusion to work seamlessly.

Elements

So far, I have glossed over defining an actual element. Let’s take an in-depth look at elements, which are represented by arbitrary names and must be enclosed in angle brackets. There are several different variations of elements in the sample document, as shown here:

  <!-- Standard element opening tag -->
  <items>

  <!-- Standard element with attribute -->
  <rdf:li 
    rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp">

  <!-- Element with textual data -->
  <dc:creator>Neil Gaiman</dc:creator>

  <!-- Empty element -->
  <l:permalink l:type="text/html" 
      rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp"
  />

  <!-- Standard element closing tag -->
  </items>

Tip

This isn’t actual XML; it’s just a collection of examples. Trying to parse something like this would fail, as there are opening tags without corresponding closing tags.

The first rule in creating elements is that their names must start with a letter or underscore, and then may contain any amount of letters, numbers, underscores, hyphens, or periods. They may not contain embedded spaces:

<!-- Embedded spaces are not allowed -->
<my element name>

XML element names are also case-sensitive. Generally, using the same rules that govern Java variable naming will result in sound XML element naming. Using an element named tcbo to represent Telecommunications Business Object is not a good idea because it is cryptic, while an overly verbose tag name like beginningOfNewChapter just clutters up a document. Keep in mind that your XML documents will probably be seen by other developers and content authors, so clear documentation through good naming is essential.

Every opened element must in turn be closed. There are no exceptions to this rule as there are in many other markup languages, like HTML. An ending element tag consists of the forward slash and then the element name: </items>. Between an opening and closing tag, there can be any number of additional elements or textual data. However, you cannot mix the order of nested tags; the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. A well-formed document is one in which all XML syntax rules are followed, and all elements and attributes are correctly positioned. However, a well-formed document is not necessarily valid, which means that it follows the constraints set upon a document by its DTD or schema. There is a significant difference between a well-formed document and a valid one; the rules I discuss in this section ensure that your document is well-formed, while the rules discussed in Chapter 2 ensure that your document is valid.

As an example of a document that is not well-formed, consider this XML fragment:

<tag1>
 <tag2>
</tag1>
 </tag2>

The order of nesting of tags is incorrect, as the opened <tag2> is not followed by a closing </tag2> within the surrounding tag1 element. However, even if these syntax errors are corrected, there is still no guarantee that the document will be valid.

While this example of a document that is not well-formed may seem trivial, remember that this would be acceptable HTML, and commonly occurs in large tables within an HTML document. In other words, HTML and many other markup languages do not require well-formed XML documents. XML’s strict adherence to ordering and nesting rules allows data to be parsed and handled much more quickly than when using markup languages without these constraints.

The last rule I’ll look at is the case of empty elements. I already said that XML tags must always be paired; an opening tag and a closing tag constitute a complete XML element. There are cases where an element is used purely by itself, like a flag stating a chapter is incomplete, or where an element has attributes but no textual data, like an image declaration in HTML. These would have to be represented as:

<admin:generatorAgent rdf:resource="http://www.blogger.com/">
</admin:generatorAgent>

<img src="/images/xml.gif"></img>

This is obviously a bit silly, and adds clutter to what can often be very large XML documents. The XML specification provides a means to signify both an opening and closing element tag within one element:

<admin:generatorAgent rdf:resource="http://www.blogger.com/" />
<img src="/images/xml.gif" />

Well, let me tell you. I’ve had the unfortunate pleasure of working with Java and XML since late 1998, when things were rough at best. And some web browsers at that time (and some today, to be honest) would only accept XHTML (HTML that is well-formed) in very specific formats. Most notably, tags like <br> that are never closed in HTML must be closed in XHTML, resulting in <br/>. Some of these browsers would completely ignore a tag like this; however, oddly enough, they would happily process <br /> (note the space before the end slash). I got used to making my XML not only well-formed, but consumable by these browsers. I’ve never had a good reason to change these habits, so you get to see them in action here.

This nicely solves the problem of unnecessary clutter, and still follows the rule that every XML element must have a matching end tag; it simply consolidates both start and end tag into a single tag.

Attributes

In addition to text contained within an element’s tags, an element can also have attributes. Attributes are included with their respective values within the element’s opening declaration (which can also be its closing declaration!). For example, in the channel element, a URL for information about the channel is noted in an attribute:

<channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">

In this example, rdf:about is the attribute name; the value is the URL, "http://www.neilgaiman.com/journal/journal.asp". Attribute names must follow the same rules as XML element names, and attribute values must be within quotation marks. Although both single and double quotes are allowed, double quotes are a widely used standard and result in XML documents that model Java programming practices.

In addition to learning how to use attributes, there is an issue of when to use attributes. Because XML allows such a variety of data formatting, it is rare that an attribute cannot be represented by an element, or that an element could not easily be converted to an attribute. Although there’s no specification or widely accepted standard for determining when to use an attribute and when to use an element, there is a good rule of thumb: use elements for multiple-valued data and attributes for single-valued data. If data can have multiple values, or is very lengthy, the data most likely belongs in an element. It can then be treated primarily as textual data, and is easily searchable and usable. Examples are the description of a book’s chapters, or URLs detailing related links from a site. However, if the data is primarily represented as a single value, it is best represented by an attribute. A good candidate for an attribute is the section of a chapter; while the section item itself might be an element and have its own title, the grouping of chapters within a section could be easily represented by a section attribute within the chapter element. This attribute would allow easy grouping and indexing of chapters, but would never be directly displayed to the user. Another good example of a piece of data that could be represented in XML as an attribute is if a particular table or chair is on layaway. This instruction could let an XML application used to generate a brochure or flyer know to not include items on layaway in current stock; obviously this is a true or false value, and has only a singular value at any time. Again, the application client would never directly see this information, but the data would be used in processing and handling the XML document. If after all of this analysis you are still unsure, you can always play it safe and use an element.

Namespaces

Note the use of namespaces in the root element of Example 1-1:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">

An XML namespace is a means of associating one or more elements in an XML document with a particular URI. This means that the element is identified by both its name and its namespace URI. In many complex XML documents, the same XML name (for example, author) may need to be used in different ways. For instance, in the example, there is an author for the RSS feed, as well as an author for each journal entry. While both of these pieces of data fit nicely into an element named author, they should not be taken as the same type of data.

The XML namespaces specification nicely solves this problem. The namespace specification requires that a unique URI be associated with a prefix to distinguish the elements in one namespace from elements in other namespaces. So you could assign a URI of http://www.neilgaiman.com/entries, and associate it with the prefix journal, for use by journal-specific elements. You could then assign another URI, like http://www.w3.org/1999/02/22-rdf-syntax-ns, and a prefix of rss, for RSS-specific elements:

<rdf:RDF xmlns:rss="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:journal="http://www.neilgaiman.com/entries">

Now you can use those prefixes in your XML:

<rss:author>Doug Hally</rss:author>
<journal:author>Neil Gaiman</journal:author>

Tip

You can actually use a namespace prefix on the same element where that namespace is declared. For example, this is perfectly legal XML:

<rss:author xmlns:rss="http://www.w3.org/1999/02/22-rdf-syntax-ns#">Doug Hally</rss:author>

An XML parser can now easily distinguish these two different types of author; as an added benefit, the XML is a lot more human-readable now.

Entity References

One item I have not discussed is escaping characters, or referring to other constant type data values. For example, a common way to represent a path to an installation directory in online documentation is <path-to-Ant> or <TOMCAT_HOME>. Here, the user would replace the text with the appropriate choice of installation directory. In the following journal entry, there are several HTML tags within the entry itself:

When the shoot was done, my daughter Holly, who had been doing her 
homework in the room next door, and occasionally coming out to laugh 
at me, helped use up the last few pictures on the roll. She looks like 
she's having fun. I think I look a little dazed.<br /><br />
<img src="http://www.neilgaiman.com/journal/Neil_8313036.jpg" ><br />
<br />This is the one we're going to be using on the book jacket of 
ANANSI BOYS.

The problem is that XML parsers attempt to handle these bits of data (<br /> and <img>) as XML tags. This is a common problem, as any use of angle brackets results in this behavior. Entity references provide a way to overcome this problem. An entity reference is a special data type in XML used to refer to another piece of data. The entity reference consists of a unique name, preceded by an ampersand and followed by a semicolon: & [entity name] ;. When an XML parser sees an entity reference, the specified substitution value is inserted, and no processing of that value occurs. XML defines five entities to address the problem discussed in the example: < for the less-than bracket, > for the greater-than bracket, & for the ampersand sign itself, " for a double quotation mark, and ' for a single quotation mark or apostrophe. Using these special references, the entry can contain the HTML tags without having them interpreted as XML tags by the XML parser:

When the shoot was done, my daughter Holly, who had been doing her 
homework in the room next door, and occasionally coming out to laugh 
at me, helped use up the last few pictures on the roll. She looks like 
she's having fun. I think I look a little dazed.&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://www.neilgaiman.com/journal/Neil_8313036.jpg" 
/&gt;&lt;br /&gt;&lt;br /&gt;This is the one we're going to be using 
on the book jacket of ANANSI BOYS.

Once this document is parsed, the data is interpreted as normal HTML br and img tags, and the document is still considered well-formed.

Also be aware that entity references are user-definable. This allows a sort of shortcut markup; for example, you might want to reference a copyright notice online somewhere. Because the copyright is used for multiple books and articles, it doesn’t make sense to include the actual text within hundreds of different XML documents; however, if the copyright is changed, all referring XML documents should reflect the changes:

<ora:copyright>&OReillyCopyright;</ora:copyright>

Although you won’t see how the XML parser is told what to reference when it sees &OReillyCopyright; until the next chapter, you need to realize that there are more uses for entity references than just representing difficult or unusual characters within data.

Unparsed Data

The last XML construct to look at is the CDATA section marker. A CDATA section is used when a significant amount of data should be passed on to the calling application without any XML parsing. It is used when an unusually large number of characters would have to be escaped using entity references, or when spacing must be preserved. In an XML document, a CDATA section looks like this:

<content:encoded><![CDATA[Lot of flying yesterday and now I'm home again. 
For a day. Last night's useful post was written, but was eaten by weasels. 
Next week is the last week of <em>Beowulf-</em>with-Avary-and-Zemeckis work 
for a long while, and then I get to be home for about a month, if you 
don't count the trip to New York for Book Expo, and right now I just 
like the idea of sleeping in my own bed for a couple of nights running.
<br /><br /> </p>]]></content:encoded>

In this example, the information within the CDATA section does not have to use entity references or other mechanisms to alert the parser that reserved characters are being used; instead, the XML parser passes them unchanged to the wrapping program or application.

At this point, you have seen the major components of XML documents. Although each has only been briefly described, this should give you enough information to recognize the parts of an XML document when you see them and know their general purpose.

XML 1.1

In February of 2004, the XML 1.1 specification was released by the World Wide Web Consortium (W3C; http://www.w3.org). If you don’t recall hearing much about XML 1.1, it’s no surprise; XML 1.1 was largely about Unicode conformance, and really didn’t affect XML as a whole that much, particularly for document authors and programmers not working with unusual character sets.

While XML was undergoing fairly minor maintenance updates, Unicode moved from Version 2.0 to 4.0. Since XML relies on Unicode for the characters allowed in XML element and attribute names, this had a ripple effect on document authors who wanted to use the new Unicode 4.0 characters in their documents. In XML 1.0, the specification had to explicitly permit characters to be in element and attribute names; as a result, new characters in later versions of Unicode were excluded for name usage by parsers. In XML 1.1—in an effort to avoid similar problems in the future—characters not explicitly forbidden are permitted. This means that if new characters are added in future Unicode versions, they can immediately be used in XML 1.1 documents.

If all of this doesn’t mean anything to you, then you probably don’t need to be too concerned about XML 1.1. Personally, I still type in version="1.0" and haven’t needed to change that yet. If you want to understand more about the intricacies of Unicode and XML 1.1, check out the complete specification at http://www.w3.org/TR/xml11.

Tip

All the tools and parsers used throughout this book will work with XML 1.0 and 1.1 documents.

XML Transformations

One of the cooler things about XML is the ability to transform it into something else. With the wealth of web-capable devices these days (computers, personal organizers, phones, DVRs, etc.), you never know what flavor of markup you need to deliver. Sometimes HTML works, sometimes XHTML (the XML flavor of HTML) is required, sometimes the Wireless Markup Language (WML) is supported; and sometimes you need something else entirely. In all of these cases, though, the basic data being displayed is the same; it’s just the formatting and presentation that changes. A great technique is to store the data in an XML document, and then transform that XML into various formats for display.

As useful as XML transformations can be, though, they are not simple to implement. In fact, rather than trying to specify the transformation of XML in the original XML 1.0 specification, the W3C has put out three separate recommendations to define how XML transformations work.

Because these three specifications are tied together tightly and are almost always used in concert, there is rarely a clear distinction between them. This can often make for a discussion that is easy to understand, but not necessarily technically correct. In other words, the term XSLT, which refers specifically to extensible stylesheet transformations, is often applied to both XSL and XPath. In the same fashion, XSL is often used as a grouping term for all three technologies. In this section, I distinguish among the three recommendations, and remain true to the letter of the specifications outlining these technologies. However, in the interest of clarity, I use XSL and XSLT interchangeably to refer to the complete transformation process throughout the rest of the book. Although this may not follow the letter of these specifications, it certainly follows their spirit, as well as avoiding lengthy definitions of simple concepts when you already understand what I mean.

XSL

XSL is the Extensible Stylesheet Language. It is defined as a language for expressing stylesheets. This broad definition is broken down into two parts:

XSL is a language for transforming XML documents.
XSL is an XML vocabulary for specifying the formatting of XML documents.

The definitions are similar, but one deals with moving from one XML document form to another, while the other focuses on the actual presentation of content within each document. Perhaps a clearer definition would be to say that XSL handles the specification of how to transform a document from format A to format B. The components of the language handle the processing and identification of the constructs used to do this.

XSL and trees

The most i mportant concept to understand in XSL is that all data within XSL processing stages is in tree structures (see Figure 1-1). In fact, the rules you define using XSL are themselves held in a tree structure. This allows simple processing of the hierarchical structure of XML documents. Templates are used to match the root element of the XML document being processed. Then “leaf” rules are applied to “leaf” elements, filtering down to the most nested elements. At any point in this progression, elements can be processed, styled, ignored, copied, or have a variety of other things done to them.

Figure 1-1. Tree operations within XSL

A nice advantage of this tree structure is that it allows the grouping of XML documents to be maintained. If element A contains elements B and C, and element A is moved or copied, the elements contained within it receive the same treatment.

This makes the handling of large data sections that need to receive the same treatment fast and easy to notate concisely in the XSL stylesheet. You will see more about how this tree is constructed when I talk specifically about XSLT in the next section.

Formatting objects

The XSL specification is almost entirely concerned with defining formatting objects. A formatting object is based on a large model, not surprisingly called the formatting model. This model is all about a set of objects that are fed as input into a formatter. The formatter applies the objects to the document, and what results is a new document that consists of all or part of the data from the original XML document in a format specific to the objects the formatter used. Because this is such a vague, shadowy concept, the XSL specification attempts to define a concrete model to which these objects should conform. In other words, a large set of properties and vocabulary make up the set of features that formatting objects can use. These include the types of areas that may be visualized by the objects; the properties of lines, fonts, graphics, and other visual objects; inline and block formatting objects; and a wealth of other syntactical constructs.

Formatting objects are used heavily when converting textual XML data into binary formats such as PDF files, images, or document formats such as Microsoft Word. For transforming XML data to another textual format, these objects are seldom used explicitly. Although an underlying part of the stylesheet logic, formatting objects are rarely invoked directly, since the resulting textual data often conforms to another predefined markup language such as HTML. Because most enterprise applications today are based at least in part on web architecture and use a browser as a client, I spend the most time looking at transformations to HTML and XHTML. While formatting objects are covered only lightly, the topic is broad enough to merit its own coverage in a separate book. For further information, consult the XSL specification at http://www.w3.org/TR/xsl.

XSLT

The second component of XML transformations is XSL Transformations. XSLT is the language that specifies the conversion of a document from one format to another (where XSL defined the means of that specification). The syntax used within XSLT is generally concerned with textual transformations that do not result in binary data output. For example, XSLT is instrumental is generating HTML or WML from an XML document. In fact, the XSLT specification outlines the syntax of an XSL stylesheet more explicitly than the XSL specification itself!

Just as in the case of XSL, an XSLT stylesheet is always well-formed, valid XML. A DTD is defined for XSL and XSLT that delineates the allowed constructs. For this reason, you should only have to learn new syntax to use XSLT, and not new structural rules (if you know how XML is structured, you know how XSLT is structured). Just as in XSL, XSLT is based on a hierarchical tree structure of data, where nested elements are leaves, or children, of their parents. XSLT provides a mechanism for matching patterns within the original XML document, and applying formatting to that data. This results in anything from outputting XML data without the unwanted element names to inserting the data into a complex HTML table and displaying it to the user with highlighting and coloring. XSLT also provides syntax for many common operators, such as conditionals, copying of document tree fragments, advanced pattern matching, and the ability to access elements within the input XML data in an absolute and relative path structure. All these constructs are designed to ease the process of transforming an XML document into a new format.

XPath

As the final piece of the XML transformations puzzle, XPath provides a mechanism for referring to the wide variety of element and attribute names and values in an XML document. As I mentioned earlier, many XML specifications are now using XPath, but this discussion is concerned primarily with its use in XSLT. With the complex structure that an XML document can have, locating one specific element or set of elements can be difficult. It is made more difficult because access to a set of constraints that outlines the document’s structure cannot be assumed; documents that are not validated must be able to be transformed just as valid documents can. To accomplish this addressing of elements, XPath defines syntax in line with the tree structure of XML, and the XSLT processes and constructs that use it.

Referencing any element or attribute within an XML document is most easily accomplished by specifying the path to the element relative to the current element being processed. In other words, if element B is the current element and element C and element D are nested within it, a relative path most easily locates them. This is similar to the relative paths used in operating system directory structures. At the same time, XPath also defines addressing for elements relative to the root of a document. This covers the common case of needing to reference an element not within the current element’s scope; in other words, an element that is not nested within the element being processed. Finally, XPath defines syntax for actual pattern matching: find an element whose parent is element E and that has a sibling element F. This fills in the gaps left between the absolute and relative paths. In all these expressions, attributes can be used as well, with similar matching abilities:

<!-- Match the element named link underneath the current element -->
<xsl:value-of select="link" />

<!-- Match the element named title nested within the channel element -->
<xsl:value-of select="channel/title" />

<!-- Match the description element using an absolute path -->
<xsl:value-of select="/rdf:RDF/description" />

<!-- Match the resource attribute of the current element -->
<xsl:value-of select="@rdf:resource" />

<!-- Match the resource attribute of the errorReportsTo element -->
<xsl:value-of select="/rdf:RDF/channel/admin:errorReportsTo/@rdf:resource" />

Because the input document is often not fixed, an XPath expression can result in the evaluation of no input data, one input element or attribute, or multiple input elements and attributes. This ability makes XPath very useful and handy; it also causes the introduction of some additional terms. The result of evaluating an XPath expression can be a node set. This name is in line with the idea of a hierarchical structure, which is dealt with in terms of leaves and nodes. The resultant node set can be empty, have a single member, or have 5 or 10 members. It can be transformed, copied, ignored, or have any other legal operation performed on it. Instead of a node set, evaluating an XPath expression could result in a Boolean value, a numerical value, or a string value.

In addition to expressions that select node sets, XPath defines several functions that operate on node sets, like not() and count(). These functions take in a node set as input and operate upon that node set. All of these expressions and functions are part of the XPath specification and XPath implementations; however, XPath is also often used to signify any expression that conforms to the specification itself. As with XSL and XSLT, this makes it easier to talk about XSL and XPath, though it is not always technically correct.

With all that in mind, you’re at least somewhat prepared to take a look at a simple XSL stylesheet, shown in Example 1-2.

Example 1-2. XSL stylesheet for Example 1-1

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
                xmlns:rss="http://purl.org/rss/1.0/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

<xsl:template match="/rdf:RDF">
<p>
 <a><xsl:attribute name="href">
     <xsl:value-of select="rss:channel/rss:link"/>
    </xsl:attribute>
    <xsl:value-of select="rss:channel/rss:title"/></a>
</p>
<p>

<!-- Make the date presentable -->
 <xsl:variable name="datetime" select="rss:channel/dc:date"/>
 <xsl:variable name="day" select="substring($datetime, 9, 2)"/>
 <xsl:variable name="month" select="substring($datetime, 6, 2)"/>
 <xsl:variable name="year" select="substring($datetime, 0, 5)"/>
 <xsl:value-of select="concat($day, '/', $month, '/', $year)"/> - 
 <xsl:value-of select="substring($datetime, 12, 5)"/>
</p>

<dl>
<xsl:for-each select="rss:item">
    <dt>
        <a><xsl:attribute name="href">
             <xsl:value-of select="rss:link"/>
           </xsl:attribute>
        <xsl:value-of select="rss:title"/></a>
    </dt>
    <dd>
        <xsl:value-of select="rss:description"
             disable-output-escaping="yes" />
        <!-- Format the publish date -->
        (<xsl:variable name="pubdate" select="dc:date"/>
        <xsl:variable name="pubday" select="substring($pubdate, 9, 2)"/>
        <xsl:variable name="pubmonth" select="substring($pubdate, 6, 2)"/>
        <xsl:variable name="pubyear" select="substring($pubdate, 0, 5)"/>
        <xsl:value-of select="concat($pubday, '/', $pubmonth, '/', $pubyear)"/> - 
        <xsl:value-of select="substring($pubdate, 12, 5)"/>)
    </dd>
</xsl:for-each>
</dl>

<p>
 <xsl:value-of select="rss:channel/dc:rights"/>
</p>
</xsl:template>

</xsl:stylesheet>

Template matching

The basis of all XSL work is template matching. For any element on which you want some sort of output to occur, you generally provide a template that matches the element. You signify a template with the template keyword, and provide the name of the element to match in its match attribute:

<xsl:template match="/rdf:RDF">
<p>
 <a><xsl:attribute name="href">
     <xsl:value-of select="rss:channel/rss:link"/>
    </xsl:attribute>
    <xsl:value-of select="rss:channel/rss:title"/></a>
</p>

  <!-- etc... -->
</xsl:template>

Here, the RDF element (in the rdf-associated namespace) is being matched (the / is an XPath construct). When an XSL processor encounters the RDF element, the instructions within this template are carried out. In the example, several HTML formatting tags are output (the p and a tags). Be sure to distinguish your XSL elements from other elements (such as HTML elements) with proper use of namespaces.

You can use the value-of construct to obtain the value of an element, and provide the element name to match through the select attribute. In the example, the character data within the title element is extracted and used as the title of the page, and a link is constructed using the link element as the target.

On the other hand, when you want to cause the templates associated with an element’s children to be applied, use apply-templates. Be sure to do this, or nested elements can be ignored! You can specify the elements to apply templates to using the select attribute; by specifying a value of * to that attribute, all templates left will be applied to all nested elements.

Looping

You’ll also often find a need for looping in XSL:

<xsl:for-each select="rss:item">
    <dt>
        <a><xsl:attribute name="href">
            <xsl:value-of select="rss:link"/></xsl:attribute>
           <xsl:value-of select="rss:title"/></a>
    </dt>
    <dd>
        <xsl:value-of select="rss:description"
             disable-output-escaping="yes" />
        <!-- Format the publish date -->
        (<xsl:variable name="pubdate" select="dc:date"/>
        <xsl:variable name="pubday" select="substring($pubdate, 9, 2)"/>
        <xsl:variable name="pubmonth" select="substring($pubdate, 6, 2)"/>
        <xsl:variable name="pubyear" select="substring($pubdate, 0, 5)"/>
        <xsl:value-of select="concat($pubday, '/', $pubmonth, '/', $pubyear)"/> - 
        <xsl:value-of select="substring($pubdate, 12, 5)"/>)
    </dd>
</xsl:for-each>

Here, I’m looping through each element named item using the for-each construct. In Java, this would be:

for (Iterator i = item.iterator(); i.hasNext( ); ) {
    // take action on each item
}

Within the loop, the “current” element becomes the next item element encountered. For each item, I output the description (the entry text) using the value-ofconstruct. Take particular note of the disable-output-escaping attribute. In the XML, the description element has HTML content, which makes liberal use of entity references:

When the shoot was done, my daughter Holly, who had been doing her 
homework in the room next door, and occasionally coming out to laugh 
at me, helped use up the last few pictures on the roll. She looks like 
she's having fun. I think I look a little dazed.&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://www.neilgaiman.com/journal/Neil_8313036.jpg" 
/&gt;&lt;br /&gt;&lt;br /&gt;This is the one we're going to be using 
on the book jacket of ANANSI BOYS.

Normally, value-of outputs text just as it is in the XML document being processed. The result would be that this escaped HTML would stay escaped. The output document would end up looking like Figure 1-2.

Figure 1-2. With output escaping on, HTML content within XML elements often won’t look correct

To ensure that your output is not escaped, set disable-output-escaping to yes.

Tip

Be sure you think this through. I used to get confused, thinking that I wanted to set this attribute to no so that escaping would not happen. However, a value of no results in escaping being enabled (not being disabled). Make sure you get this straight, or you’ll have some odd results.

Setting this attribute to yes and rerunning the transform results in the output shown in Figure 1-3.

Figure 1-3. With escaping turned off, output shows up as HTML, which is almost certainly the desired result

Performing a transform

Before leaving XSL (at least for now), I want to show you how to easily perform transformations from the command line. This is a useful tool for quick-and-dirty tests; in fact, it’s how I generated the screenshots used in this chapter.

Download Xalan-J from the Xalan web site, http://xml.apache.org/xalan-j. Expand the archive (on my Windows laptop, I use c:/java/xalan-j_2_6_0).

Then add xalan.jar, xercesImpl.jar, and xml-apis.jar to your classpath. Finally, run the following command:

java org.apache.xalan.xslt.Process –IN [XML filename]
                                   -XSL [XSL stylesheet]
                                   -OUT [output filename]

For example, to generate the HTML output for Neil Gaiman’s feed, I used the tool like this:

> java org.apache.xalan.xslt.Process -IN gaiman-blogger_rss.xml 
                                     -XSL rdf.xsl -OUT test.html

You’ll get a file (test.html in this case) in the directory in which you run the command. Use this tool often; it will really help you figure out how XSL works, and what effect small changes have on output.

And More...

Lest I mislead you into thinking that’s all that there is to XML, I want to make sure that you realize there are a multitude of other XML-related technologies. I can’t possibly get into them all in this chapter, or even in this book. You should take a quick glance at things like Cascading Style Sheets ( CSS) and XHTML if you are working on web design. Document authors will want to find out more about XLink and XPointer. XQuery will be of interest to database programmers. In other words, there’s something XML for pretty much every technology space right now. Take a look at the W3C XML activity page at http://www.w3.org/XML and see what looks interesting.

Get Java and XML, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Java and XML, 3rd Edition by Brett McLaughlin, Justin Edelson