BUY THIS BOOK
Add to Cart

Print Book $49.99


Add to Cart

Print+PDF $64.99

Add to Cart

PDF $39.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £35.50

What is this?

Looking to Reprint or License this content?


Java and XML
Java and XML, Third Edition By Brett McLaughlin, Justin Edelson
December 2006
Pages: 479

Cover | Table of Contents


Table of Contents

Chapter 1: Introduction
In the next two chapters, I’m going to give you a crash course in XML and constraints. Since there is so much material available on XML and related specifications, I’d rather cruise through this material quickly and get on to Java. For those of you who are completely new to XML, you might want to have a few of the following books around as reference:
XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means
Learning XML, by Erik Ray
Learning XSLT, by Michael Fitzgerald
XSLT, by Doug Tidwell
These are all O’Reilly books, and I have them scattered about my own workspace. With that said, let’s dive in.
It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I’ll use it to illustrate several important concepts.
Example . A typical XML document is long and verbose
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <!--Generated by Blogger v5.0-->
  <channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
    <title>Neil Gaiman's Journal</title>
    <link>http://www.neilgaiman.com/journal/journal.asp</link>
    <description>Neil Gaiman's Journal</description>
    <dc:date>2005-04-30T01:57:38Z</dc:date>
    <dc:language>en-US</dc:language>
    <admin:generatorAgent rdf:resource="http://www.blogger.com/" />
    <admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
    <items>
      <rdf:Seq>
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
      </rdf:Seq>
    </items>
  </channel>

  <!-- and so on... -->
</rdf:RDF>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML 1.0
It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I’ll use it to illustrate several important concepts.
Example . A typical XML document is long and verbose
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <!--Generated by Blogger v5.0-->
  <channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
    <title>Neil Gaiman's Journal</title>
    <link>http://www.neilgaiman.com/journal/journal.asp</link>
    <description>Neil Gaiman's Journal</description>
    <dc:date>2005-04-30T01:57:38Z</dc:date>
    <dc:language>en-US</dc:language>
    <admin:generatorAgent rdf:resource="http://www.blogger.com/" />
    <admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
    <items>
      <rdf:Seq>
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
      </rdf:Seq>
    </items>
  </channel>

  <!-- and so on... -->
</rdf:RDF>
For those of you who are curious, this is the RSS feed for Neil Gaiman’s blog (http://www.neilgaiman.com). It uses a lot of RSS syntax, which I’ll cover in Chapter 12 in detail.
A lot of this specification describes what is mostly intuitive. If you’ve done any HTML authoring, or SGML, you’re already familiar with the concept of elements (such as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML 1.1
In February of 2004, the XML 1.1 specification was released by the World Wide Web Consortium (W3C; http://www.w3.org). If you don’t recall hearing much about XML 1.1, it’s no surprise; XML 1.1 was largely about Unicode conformance, and really didn’t affect XML as a whole that much, particularly for document authors and programmers not working with unusual character sets.
While XML was undergoing fairly minor maintenance updates, Unicode moved from Version 2.0 to 4.0. Since XML relies on Unicode for the characters allowed in XML element and attribute names, this had a ripple effect on document authors who wanted to use the new Unicode 4.0 characters in their documents. In XML 1.0, the specification had to explicitly permit characters to be in element and attribute names; as a result, new characters in later versions of Unicode were excluded for name usage by parsers. In XML 1.1—in an effort to avoid similar problems in the future—characters not explicitly forbidden are permitted. This means that if new characters are added in future Unicode versions, they can immediately be used in XML 1.1 documents.
If all of this doesn’t mean anything to you, then you probably don’t need to be too concerned about XML 1.1. Personally, I still type in version="1.0" and haven’t needed to change that yet. If you want to understand more about the intricacies of Unicode and XML 1.1, check out the complete specification at http://www.w3.org/TR/xml11.
All the tools and parsers used throughout this book will work with XML 1.0 and 1.1 documents.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Transformations
One of the cooler things about XML is the ability to transform it into something else. With the wealth of web-capable devices these days (computers, personal organizers, phones, DVRs, etc.), you never know what flavor of markup you need to deliver. Sometimes HTML works, sometimes XHTML (the XML flavor of HTML) is required, sometimes the Wireless Markup Language (WML) is supported; and sometimes you need something else entirely. In all of these cases, though, the basic data being displayed is the same; it’s just the formatting and presentation that changes. A great technique is to store the data in an XML document, and then transform that XML into various formats for display.
As useful as XML transformations can be, though, they are not simple to implement. In fact, rather than trying to specify the transformation of XML in the original XML 1.0 specification, the W3C has put out three separate recommendations to define how XML transformations work.
Because these three specifications are tied together tightly and are almost always used in concert, there is rarely a clear distinction between them. This can often make for a discussion that is easy to understand, but not necessarily technically correct. In other words, the term XSLT, which refers specifically to extensible stylesheet transformations, is often applied to both XSL and XPath. In the same fashion, XSL is often used as a grouping term for all three technologies. In this section, I distinguish among the three recommendations, and remain true to the letter of the specifications outlining these technologies. However, in the interest of clarity, I use XSL and XSLT interchangeably to refer to the complete transformation process throughout the rest of the book. Although this may not follow the letter of these specifications, it certainly follows their spirit, as well as avoiding lengthy definitions of simple concepts when you already understand what I mean.
XSL is the Extensible Stylesheet Language. It is defined as a language for expressing stylesheets. This broad definition is broken down into two parts:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
And More...
Lest I mislead you into thinking that’s all that there is to XML, I want to make sure that you realize there are a multitude of other XML-related technologies. I can’t possibly get into them all in this chapter, or even in this book. You should take a quick glance at things like Cascading Style Sheets ( CSS) and XHTML if you are working on web design. Document authors will want to find out more about XLink and XPointer. XQuery will be of interest to database programmers. In other words, there’s something XML for pretty much every technology space right now. Take a look at the W3C XML activity page at http://www.w3.org/XML and see what looks interesting.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Constraints
It’s rare that you’ll be able to author XML without worrying about anyone else modifying your document, or anyone having to interpret the meaning of the document. The majority of the time, someone (or something) will have to figure out what your tags mean, what data is allowed within those tags, and how your document is structured. This is where constraint models come into play in the XML world. A constraint model defines the structure of your document and, to some degree, the data allowed within that structure.
In fact, if you take XML as being a data representation, you really can’t divorce a document (often called an instance) from its constraints (the schema). The instance document contains the data, and the schema gives form to that data. You can’t have one without the other; at least, not without introducing tremendous room for error. An instance document without a schema must be interpreted by the recipient; and do you really want him deciding what your elements and attributes meant?
There’s an argument that essentially goes like this: “Good XML should be structured so that it’s self-documenting.” That’s a good goal, but practically impossible. As a programmer, I often think my code is well documented and easily understood; but I’m assuming a certain level of expertise, and a certain approach to coding. Change just a few bits here and there, and someone else might reasonably interpret my “well-documented” code (or XML) completely differently than I might. Taking the time to write a schema solves this problem much more definitively.
There are three basic models for constraints in use today:
DTDs
Introduced as part of the XML 1.0 specification, DTDs are the oldest constraint model around in the XML world. They’re simply to use, but this simplicity comes at a price: DTDs are inflexible, and offer you little for data type validation as well.
XML Schema (XSD)
XML Schema is the W3C’s anointed successor to DTDs. XML Schemas are literally orders of magnitude more flexible than DTDs, and offer an almost dizzying array of support for various data types. However, just as DTDs were simple and limited, XML Schemas are flexible, complex, and (some would argue) bloated. It takes a lot of work to write a good schema, even for 50- or 100-line XML documents. For this reason, there’s been a lot of dissatisfaction with XML Schema, even though they are widely being used.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DTDs
A DTD defines how data is formatted. It must define each allowed element in an XML document, the allowed attributes, and—when appropriate—the acceptable attribute values for each element; it also indicates the nesting and occurrences of each element, and any external entities. DTDs can specify many other things about an XML document, but these basics are what I’ll focus on here.
This chapter is by no means an extensive treatment of DTDs, XML Schema, or RELAX NG. For more detail on all of these schema types, check out XML in a Nutshell by Elliotte Rusty Harold and W. Scott Means (O’Reilly), and RELAX NG by Eric van der Vlist (O’Reilly), both exhaustive works on XML and RELAX NG.
There’s remarkably little to a DTD’s semantics, although you will have to use a totally different syntax for notation than you do in XML (an annoyance corrected in both XML Schema and RELAX NG).

Elements

The bulk of the DTD is composed of ELEMENT definitions (covered in this section) and ATTRIBUTE definitions (covered in the next section). An element definition begins with the ELEMENT keyword, following the standard <! opening of a DTD tag, and then the name of the element. Following that name is the content model of the element. The content model is generally within parentheses and specifies what content can be included within the element. Take the item element, from the RSS 0.91 DTD (http://my.netscape.com/publish/formats/rss-0.91.dtd) as an example:
<!ELEMENT item (title | link | description)*>
This says that for any item element, there may be a title element, a link element, or a description element nested within that item. The “or” relationship is indicated by the pipe ( |) symbol; the OR applies to all elements within a group, indicated by the parentheses. In other words, for the grouping (title | link | description), one and only one of title, link, or description may appear. The asterisk after the grouping indicates a recurrence. Table 2-1 lists the complete set of DTD recurrence modifiers.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Schema
XML Schema seeks to improve upon DTDs by adding more typing and quite a few more constructs than DTDs, as well as using XML as the constraint representation format. I’m going to spend relatively little time here talking about schemas, because they are a behind the scenes detail for Java and XML. In the chapters where you’ll be working with schemas, I’ll address any specific points you need to be aware of. However, the specification for XML Schema is so enormous that it would take up an entire book of explanation on its own. As a matter of fact, XML Schema by Eric van der Vlist (O’Reilly) is just that: an entire book on XML Schema.
Before getting into the actual schema constructs, take a look at a typical XML Schema root element:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:dw="http://www.ibm.com/developerWorks/" 
    elementFormDefault="unqualified" 
    attributeFormDefault="unqualified" version="4.0">
There’s quite a bit going on here, including two different namespace declarations. First, the XML Schema namespace itself is attached to the xsd prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the dw namespace is defined; this particular example is from the IBM DeveloperWorks XML article template, and dw is used for DeveloperWorks-specific constructs.
Then, the values of attributeFormDefault and elementFormDefault are set to "unqualified". This allows XML instance documents to omit namespace declarations on elements and attributes. Qualifications are a fairly tricky idea, largely because attributes in XML do not fall into the default namespace; they must explicitly be assigned to a namespace. For a lot more on qualification, check out the relevant portion of the XML Schema specification at http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-schema.
Finally, the version attribute is given a value of "4.0". This is used to indicate the version of this particular schema, not of the XML Schema specification being used. The namespace assigned to the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RELAX NG
RELAX NG is, in many senses, the rebel child in the constraint family. While DTDs and XML Schema are both W3C specifications (or at least part of a specification, in the case of DTDs), RELAX NG is not endorsed or “blessed” by the W3C. And, even though it has been developed underneath the OASIS umbrella (http://www.oasis-open.org/home/index.php), RELAX NG is still seen as almost a grassroots effort to compete with—or at least provide an alternative to—XML Schema. Whatever you think about the political standing of RELAX NG, though, any good XML programmer should have RELAX NG in her constraint toolkit.
RELAX NG, like XML Schema, is pure XML. You start out by nesting everything within a grammar element:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <!-- Content model for XML -->
</grammar>
This sets up the namespace for all the elements you used, which are of course all part of the RELAX NG syntax. datatypeLibrary lets the schema know where to pull data types (covered in the “Data types” section later) from, when you type elements and attributes. You don’t have to put this on your root element, but you’ll find that’s the best place to locate the reference; otherwise, you end up burying it somewhere in the middle of your schema, and that’s a maintenance pain.
Like the XML Schema specification, you should always use the same URI for the namespace here (http://relaxng.org/ns/structure/1.0).
You’ll find that most of the RELAX NG constructs are pretty intuitive; I’ll run through the highlights.

Elements

You define elements using the element keyword, and nestings within an XML document are represented by nestings with the RELAX NG schema:
<element name="phonebook">
  <element name="entry">
    <element name="firstName">
      <text/>
    </element>
    <element name="firstName">
      <text/>
    </element>
    <!-- etc... -->   
  </element>
</element>
In fact, you should already be seeing one of the cooler features of RELAX NG: its structure closely mirrors the structure of the document it’s constraining.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: SAX
XML is fundamentally about data; programming with XML, then, has to be fundamentally about getting at that data. That process, called parsing, is the basic task of the APIs I’ll cover in the next several chapters. This chapter describes how an XML document is parsed, focusing on the events that occur within this process. These events are important, as they are all points where application-specific code can be inserted and data manipulation can occur.
I’m also going to introduce you to one of the two core XML APIs in Java: SAX, the Simple API for XML (http://www.saxproject.org). SAX is what makes insertion of this application-specific code into events possible. The interfaces provided in the SAX package are an important part of any programmer’s toolkit for handling XML. Even though the SAX classes are small and few in number, they provide a critical framework for Java and XML to operate within. Solid understanding of how they help in accessing XML data is critical to effectively leveraging XML in your Java programs.
For the impatient, the other of those two core APIs is DOM. Coverage of DOM begins in Chapter 5.
I’m increasingly of the “learning is best done by doing” philosophy, so I’m not going to hit you with a bunch of concept and theory before getting to code. SAX is a simple API, so you only need to understand its basic model, and how to get the API on your machine; beyond that, code will be your best teacher.
SAX uses a callback model for interacting with your code; you may also have heard this model called event-based programming. Whatever you call it, it’s a bit of a departure for object-oriented developers, so give it some time if you’re new to this type of programming.
In short, the parsing process is going to hum along, tearing through an XML document. Every time it encounters a tag, or comment, or text, or any other piece of XML, it calls back into your code, signaling that an event has occurred. Your code then has an opportunity to act, based on the details of that event.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Setting Up SAX
I’m increasingly of the “learning is best done by doing” philosophy, so I’m not going to hit you with a bunch of concept and theory before getting to code. SAX is a simple API, so you only need to understand its basic model, and how to get the API on your machine; beyond that, code will be your best teacher.
SAX uses a callback model for interacting with your code; you may also have heard this model called event-based programming. Whatever you call it, it’s a bit of a departure for object-oriented developers, so give it some time if you’re new to this type of programming.
In short, the parsing process is going to hum along, tearing through an XML document. Every time it encounters a tag, or comment, or text, or any other piece of XML, it calls back into your code, signaling that an event has occurred. Your code then has an opportunity to act, based on the details of that event.
For example, if SAX encounters the opening tag of an element, it fires off a startElement event. It provides information about that event, such as the name of the element, its attributes, and so on, and then your code gets to respond. You, as a programmer, have to write code for each event that is important to you—from the start of a document to a comment to the end of an element. This process is summed up in Figure 3-1.
Figure 3-1: The parsing process is controlled by the parser and your code listens for events, responding as they occur
What’s different about this model is that your code is not active, in the sense that it doesn’t ever instruct the parser, “Hey, go and parse the next element.” It’s passive, in that it waits to be called, and then leaps into action. This takes a little getting used to, but you’ll be an old hand by the end of the chapter.
Swing and AWT programmers, as well as EJB experts, are familiar with this approach to programming.
Unsurprisingly, the SAX API is made up largely of interfaces that define these various callback methods. You would implement the ContentHandler
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parsing with SAX
Without spending any further time on the preliminaries, it’s time to code. As a sample to familiarize you with SAX, this chapter details the SAXTreeViewer class. This utility uses SAX to parse an XML document, and displays the document visually as a Swing JTree.
If you don’t know anything about Swing, don’t worry; I don’t focus on that, but just use it for visual purposes. The focus will remain on SAX, and how events within parsing can be used to perform customized action.
The first thing you need to do in any SAX-based application is get an instance of a class that implements the SAX org.xml.sax.XMLReader interface; remember, this is why you downloaded a SAX-compliant parser in the first place.
SAX provides the org.xml.sax.XMLReader interface for all SAX-compliant XML parsers to implement. For example, the Xerces SAX parser implementation, org.apache.xerces.parsers.SAXParser, implements the XMLReader interface. If you have access to the source of your parser, you should see the same interface implemented in your parser’s main SAX parser class. Each XML parser must have one class (and sometimes has more than one) that implements this interface, and that is the class you need to instantiate to allow for parsing XML:
// Instantiate a Reader
XMLReader reader = 
  new org.apache.xerces.parsers.SAXParser(  );

// Do something with the parser
reader.parse(uri);
For newcomers to SAX, you may be wondering why XMLReader isn’t called Parser. In fact, it was in SAX 1.0, and then so many changes were introduced that the class had to be deprecated and renamed. As a result, you’ll call the parse(  ) method on the XMLReader class.
This approach ties you tightly to your parser vendor, though; you can use SAX’s org.xml.sax.helpers.XMLReaderFactory to get away from this:
XMLReader reader = XMLReaderFactory.createXMLReader(  );
Just set the org.xml.sax.driver system property, and you can get your vendor’s XMLReader implementation, without importing your vendor’s classes:
java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
    [MyClassName]
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Content Handlers
To let an application do something useful with XML data, you must register handlers with the SAX parser. A handler is nothing more than a set of callbacks that SAX defines; a group, if you will, of related events to which you might want to attach code.
There are four core handler interfaces defined by SAX 2.0: org.xml.sax.ContentHandler, org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver.
In this chapter, I will discuss ContentHandler and ErrorHandler. I’ll leave discussion of DTDHandler and EntityResolver for the next chapter; it is enough for now to understand that EntityResolver and DTDHandler work just like the other handlers, but just group different behaviors.
Your classes implement one or more of these handlers and fill in the callback methods with working code (or, if you desire, no code at all; this effectively ignores a certain type of event). You then register your handler implementations using setContentHandler(  ), setErrorHandler(  ), setDTDHandler(  ), and setEntityResolver(  ), all on the XMLReader class (see Figure 3-4). Then the reader invokes the callback methods on the appropriate handlers during parsing.
Figure 3-4: The handler classes are all passed into the XMLReader interface, and then used during parsing to trigger programmer-defined behaviors
For the SAXTreeViewer example, start by implementing the ContentHandler interface. ContentHandler, as the name implies, details events related to the content of an XML document: elements, attributes, character data, etc. Add the following class to the end of your SAXTreeViewer.java source listing:
class JTreeHandler implements ContentHandler {

  /** Tree Model to add nodes to */
  private DefaultTreeModel treeModel;

  /** Current node to add sub-nodes to */
  private DefaultMutableTreeNode current;

  public JTreeHandler(DefaultTreeModel treeModel, 
                     DefaultMutableTreeNode base) {
    this.treeModel = treeModel;
    this.current = base;
  }

  // ContentHandler callback implementations
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Error Handlers
In addition to providing the ContentHandler interface for handling parsing events, SAX provides an ErrorHandler interface that can be implemented to treat various error conditions that may arise during parsing (see Figure 3-10).
Figure 3-10: ErrorHandler defines only three methods, but how you implement these methods can have a huge impact on the user experience
This interface works in the same manner as the document handler already constructed, but defines only three callback methods. Through these three methods, all error conditions are handled and reported by SAX parsers.
Each method receives information about the error or warning that has occurred through a SAXParseException. This object holds the line number where the trouble was encountered, the URI of the document being treated (which could be the parsed document or an external reference within that document), and normal exception details such as a message and a printable stack trace. In addition, each of these methods can throw a SAXException. This may seem a bit odd at first: an exception handler that throws an exception? Keep in mind that each handler receives a parsing exception. This might be a warning that should not cause the parsing process to stop or an error that needs to be resolved for parsing to continue; however, the callback may need to perform system I/O or another operation that can throw another exception, and the method needs to be able to send any problems resulting from these actions up the application chain.
As an example, consider an error handler that receives error notifications and writes those errors to an error log. This callback method needs to be able to either append to or create an error log on the local filesystem. If a warning occurs within the process of parsing an XML document, the warning would be reported to this method. The intent of the warning is to give information to the callback and then continue parsing the document. However, if the error handler cannot write to the logfile, it should notify the parser and application that all parsing should stop. This can be done by catching any I/O exceptions and rethrowing these to the calling application, thus causing any further document parsing to stop. This common scenario is why error handlers must be able to throw exceptions (see Example 3-3).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Advanced SAX
What you’ve seen regarding SAX so far is essentially the simplest way to process and parse XML. And while SAX is indeed named the Simple API for XML, it offers programmers much more than basic parsing and content handling. There is an array of settings that affect parser behavior, as well as several additional handlers for edge-case scenarios; if you need to specify exactly how strings should be interned, or what behavior should occur when a DTD declares a notation, or even differentiate between CDATA sections and regular text sections, SAX provides. In fact, you can even modify and write out XML using SAX (along with a few additional packages); SAX is a full-featured API, and this chapter will give you the lowdown on features that go beyond simple parsing.
I glossed over validation in the last chapter, and probably left you with a fair amount of questions. When I cover JAXP in Chapter 7, you’ll see that you can use either a method (setValidating(  )) or a set of classes (javax.xml.validation) to handle validation; you might expect to call a similar method—setValidation(  ) or something similar—to initiate validation in SAX. But then, there’s also namespace awareness, dealt with quite a bit in Chapter 2 (and Chapter 3, with respect to Q names and local names—maybe setNamespaceAwareness(  )? But what about schema validation? And setting the location of a schema to validate on, if the document doesn’t specify one? There’s also low-level behavior, like telling the parser what to do with entities (parse them? don’t parse them?), how to handle strings, and a lot more. As you can imagine, dealing with each of these could cause real API bloat, adding 20 or 30 methods to SAX’s XMLReader class. And, even worse, each time a new setting was needed (perhaps for the next type of constraint model supported? How about setRelaxNGSchema(  )?), the SAX API would have to add a method or two, and re-release a new version. Clearly, this isn’t a very effective approach to API design.
If this isn’t clear to you, check out Head First Design Patterns, by Elisabeth and Eric Freeman (O’Reilly). In particular, read up on Chapter 1 (pages 8 and 9), which details why it’s critical to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Properties and Features
I glossed over validation in the last chapter, and probably left you with a fair amount of questions. When I cover JAXP in Chapter 7, you’ll see that you can use either a method (setValidating(  )) or a set of classes (javax.xml.validation) to handle validation; you might expect to call a similar method—setValidation(  ) or something similar—to initiate validation in SAX. But then, there’s also namespace awareness, dealt with quite a bit in Chapter 2 (and Chapter 3, with respect to Q names and local names—maybe setNamespaceAwareness(  )? But what about schema validation? And setting the location of a schema to validate on, if the document doesn’t specify one? There’s also low-level behavior, like telling the parser what to do with entities (parse them? don’t parse them?), how to handle strings, and a lot more. As you can imagine, dealing with each of these could cause real API bloat, adding 20 or 30 methods to SAX’s XMLReader class. And, even worse, each time a new setting was needed (perhaps for the next type of constraint model supported? How about setRelaxNGSchema(  )?), the SAX API would have to add a method or two, and re-release a new version. Clearly, this isn’t a very effective approach to API design.
If this isn’t clear to you, check out Head First Design Patterns, by Elisabeth and Eric Freeman (O’Reilly). In particular, read up on Chapter 1 (pages 8 and 9), which details why it’s critical to encapsulate what varies.
To address the ever-changing need to affect parser behavior, without causing constant API change, SAX 2 defines a standard mechanism for setting parser behavior: through the use of properties and features.
In SAX, a property is a setting that requires passing in some Object argument for the parser to use; for instance, certain types of handlers are set by specifying a URI and supplying the Object that implements that handler’s interface. A feature is a setting that is either on (true) or off (false). Several obvious examples come to mind: namespace awareness and validation, for example.
SAX includes the methods needed for setting properties and features in the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resolving Entities
You’ve already seen how to interact with content in the XML document you’re parsing (using ContentHandler), and how to deal with error conditions (ErrorHandler). Both of these are concerned specifically with the data in an XML document. What I haven’t talked about is the process by which the parser goes outside of the document and gets data. For example, consider a simple entity reference in an XML document:
<FM>
<P>Text placed in the public domain by Moby Lexical Tools, 1992.</P>
<P>SGML markup by Jon Bosak, 1992-1994.</P>
<P>XML version by Jon Bosak, 1996-1998.</P>
<P>&usage-terms;</P>
</FM>
Your schema then indicates to the parser how to resolve that entity:
<!ENTITY usage-terms  
    SYSTEM "http://www.newInstance.com/entities/usage-terms.xml">
At parse time, the usage-terms entity reference will be expanded (in this case, to “This work may be freely copied and distributed worldwide.”, as seen in Figure 4-1).
Figure 4-1: The usage-terms entity was resolved to a URI, which was then parsed and inserted into the document
However, there are several cases where you might not want this “default” behavior:
  • You don’t have network access, so you want the entity to resolve to a local copy of the referenced document (perhaps a version you’ve downloaded yourself).
  • You want to substitute your own content for the content specified in the schema.
You can short-circuit normal entity resolution using org.xml.sax.EntityResolver. This interface does exactly what it says: resolves entities. More important, it allows you to get involved in the entity resolution process. The interface defines only a single method, as shown in Figure 4-2.
Figure 4-2: There’s not much to the EntityResolver class; just a single, albeit useful, method
To insert your own logic into the resolution process, create an implementation of this interface, and register it with your XMLReader instance through setEntityResolver(  ). Once that’s done, every time the reader comes across an entity reference, it passes the public ID and system ID for that entity to the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Notations and Unparsed Entities
After a rather extensive look at EntityResolver, I’m going to cruise through DTDHandler (also in org.xml.sax). In almost nine years of extensive SAX and XML programming, I’ve used this interface only once—in writing JDOM (covered in Chapter 9)—and even then, it was a rather obscure case. Still, if you work with unparsed entities often, are into parser internals, or just want to get into every nook and cranny of the SAX API, then you need to know about DTDHandler. The interface is shown in all its simplicity in Figure 4-4.
Figure 4-4: This handler is concerned with the declaration of certain XML types, rather than the actual content of those entities (if and when they are resolved)
The DTDHandler interface allows you to receive notification when a reader encounters an unparsed entity or notation declaration. Of course, both of these events occur in DTDs, not XML documents, which is why this is called DTDHandler. The two methods listed in Figure 4-4 do exactly what you would expect. The first reports a notation declaration, including its name, public ID, and system ID. Remember the NOTATION structure in DTDs? (Flip back to Chapter 2 if you’re unclear.)
<!NOTATION jpeg SYSTEM "images/jpeg">
The second method provides information about an unparsed entity declaration, which looks as follows:
<!ENTITY stars_logo SYSTEM "http://www.nhl.com/img/team/dal38.gif"
                    NDATA jpeg>
In both cases, you can take action at these occurrences if you create an implementation of DTDHandler and register it with your reader through the XMLReader’s setDTDHandler(  ) method. This is generally useful when writing low-level applications that must either reproduce XML content (such as an XML editor), or when you want to build up some Java representation of a DTD’s constraints (such as in a data binding implementation). In most other situations, it isn’t something you will need very often.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The DefaultHandler Class
Because SAX is interface-driven, you have to do a lot of tedious work to get started with an XML-based application. For example, when you write your ContentHandler implementation, you have to implement each and every method of that interface, even if you aren’t inserting behavior into each callback. If you need an ErrorHandler, you add three more method implementations; using DTDHandler? That’s a few more. A lot of times, though, you’re writing lots of no-operation methods, as you only need to interact with a couple of key callbacks.
Fortunately, org.xml.sax.helpers.DefaultHandler can be a real boon in these situations. This class doesn’t define any behavior of its own; however, it does implement ContentHandler, ErrorHandler, EntityResolver, and DTDHandler, and provides empty implementations of each method of each interface. So you can have a single class (call it, for example, MyHandlerClass) that extends DefaultHandler. You then only override the callback methods you’re concerned with. You might implement startElement(  ), characters(  ), endElement(  ), and fatalError(  ), for example. In any combination of implemented methods, though, you’ll save tons of lines of code for methods you don’t need to provide action for, and make your code a lot clearer too. Then, the argument to setErrorHandler(  ), setContentHandler(  ), and setDTDHandler(  ) would be the same instance of this MyHandlerClass.
You can pass a DefaultHandler instance to setEntityResolver(  ) as well, although (as I’ve already said) I discourage mixing EntityResolver implementations in with these other handlers.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Extension Interfaces
SAX provides several extension interfaces. These are interfaces that SAX parsers are not required to support; you’ll find these interfaces in org.xml.sax.ext. In some cases, you’ll have to download these directly from the SAX web site (http://www.saxproject.org), although most parsers will include these in the parser download.
Because parsers aren’t required to support these handlers, never write code that absolutely depends on them, unless you’re sure you won’t be changing parser. If you can provide enhanced features, but fallback to standard SAX, you’re in a much better position.
The first of these handlers is probably the most useful: org.xml.sax.ext.LexicalHandler. This handler provides methods that can receive notification of several lexical events in an XML document, such as comments, entity declarations, DTD declarations, and CDATA sections. In ContentHandler, these lexical events are essentially ignored, and you just get the data and declarations without notification of when or how they were provided.
This is not really a general-use handler, as most applications don’t need to know if text was in a CDATA section or not. However, if you are working with an XML editor, serializer, or other component that must know the exact format of the input document—and not just its contents—then the LexicalHandler can really help you out.
To see this guy in action, you first need to add an import statement for org.xml.sax.ext.LexicalHandler to your SAXTreeViewer.java source file. Once that’s done, you can add LexicalHandler to the implements clause in the nonpublic class JTreeContentHandler in that source file:
class JTreeHandler implements ContentHandler, ErrorHandler, LexicalHandler {
To get started, look at the first lexical event that might happen in processing an XML document: the start and end of a DTD reference or declaration. That triggers the startDTD(  ) and endDTD(  ) callbacks (I’ve coded up versions appropriate for SAXTreeViewer here):
public void startDTD(String name, String publicID,
                     String systemID)
  throws SAXException {

  DefaultMutableTreeNode dtdReference =
    new DefaultMutableTreeNode("DTD for '" + name + "'");
  if (publicID != null) {
    DefaultMutableTreeNode publicIDNode =
      new DefaultMutableTreeNode("Public ID: '" + publicID + "'");
    dtdReference.add(publicIDNode);
  }
  if (systemID != null) {
    DefaultMutableTreeNode systemIDNode =
      new DefaultMutableTreeNode("System ID: '" + systemID + "'");
    dtdReference.add(systemIDNode);
  }
  current.add(dtdReference);
}

public void endDTD( ) throws SAXException {
  // No action needed here
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Filters and Writers
At this point, I want to diverge from the beaten path. There are a lot of additional features in SAX that can really turn you into a power developer, and take you beyond the confines of “standard” SAX. In this section, I’ll introduce you to two of these: SAX filters and writers. Using classes both in the standard SAX distribution and available separately from the SAX web site (http://www.saxproject.org), you can add some fairly advanced behavior to your SAX applications. This will also get you in the mindset of using SAX as a pipeline of events, rather than a single layer of processing.
First on the list is the org.xml.sax.XMLFilter class that comes in the basic SAX download, and should be included with any parser distribution supporting SAX 2. This class extends the XMLReader interface, and adds two new methods to that class, as shown in Figure 4-8.
Figure 4-8: Extra methods defined by the XMLFilter interface
It might not seem like there is much to say here; what’s the big deal, right? Well, by allowing a hierarchy of XMLReaders through this filtering mechanism, you can build up a processing chain, or pipeline, of events. To understand what I mean by a pipeline, you first need to understand the normal flow of a SAX parse:
  1. Events in an XML document are passed to the SAX reader.
  2. The SAX reader and registered handlers pass events and data to an application.
What developers started realizing, though, is that it is simple to insert one or more additional links into this chain:
  1. Events in an XML document are passed to the SAX reader.
  2. The SAX reader performs some processing and passes information to another SAX reader.
  3. Repeat until all SAX processing is done.
  4. Finally, the SAX reader and registered handlers pass events and data to an application.
It’s the middle two steps that create a pipeline, where one reader that performed specific processing passes its information on to another reader, repeatedly, instead of having to lump all code into one reader. When this pipeline is set up with multiple readers, modular and efficient programming results. And that’s what the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: DOM
SAX is just one of several APIs that allow XML work to be done within Java. This chapter and the next will widen your API knowledge as I introduce the Document Object Model, commonly called the DOM. This API is quite a bit different from SAX, and complements the Simple API for XML in many ways. You’ll need both, as well as the other APIs and tools in the rest of this book, to be a competent XML developer.
Because DOM is fundamentally different from SAX, I’ll spend a good bit of time discussing the concepts behind DOM, and why it might be used instead of SAX for certain applications. Selecting any XML API involves tradeoffs, and choosing between DOM and SAX is certainly no exception. I’ll move on to possibly the most important topic: code. I’ll introduce you to a utility class that serializes DOM trees and will provide a pretty good look at the DOM structure and related classes. This will get you ready for some more advanced DOM work.
The DOM, unlike SAX, has its origins in the World Wide Web Consortium (W3C; online at http://www.w3.org). Whereas SAX is public domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard—just like the actual XML specification. The DOM is designed to represent the content and model of XML documents across all programming languages and tools. On top of that specification, there are several language bindings. These bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.
In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into “levels” instead of versions. DOM Level One is an accepted recommendation, and you can view the completed specification at http://www.w3.org/TR/REC-DOM-Level-1. Level 1 details the functionality and navigation of content within a document.
A document in the DOM is not just limited to XML, but can be HTML or other content models as well.
DOM Level 2, which was finalized in November of 2000, adds core functionality to DOM Level 1. There are also several additional DOM modules and options aimed at specific content models, such as XML, HTML, and CSS. These less-generic modules begin to “fill in the blanks” left by the more general tools provided in DOM Level 1. You can view the current DOM Level 2 Recommendation at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Document Object Model
The DOM, unlike SAX, has its origins in the World Wide Web Consortium (W3C; online at http://www.w3.org). Whereas SAX is public domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard—just like the actual XML specification. The DOM is designed to represent the content and model of XML documents across all programming languages and tools. On top of that specification, there are several language bindings. These bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.
In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into “levels” instead of versions. DOM Level One is an accepted recommendation, and you can view the completed specification at http://www.w3.org/TR/REC-DOM-Level-1. Level 1 details the functionality and navigation of content within a document.
A document in the DOM is not just limited to XML, but can be HTML or other content models as well.
DOM Level 2, which was finalized in November of 2000, adds core functionality to DOM Level 1. There are also several additional DOM modules and options aimed at specific content models, such as XML, HTML, and CSS. These less-generic modules begin to “fill in the blanks” left by the more general tools provided in DOM Level 1. You can view the current DOM Level 2 Recommendation at http://www.w3.org/TR/DOM-Level-2-Core. This is actually the recommendation for the DOM Core; all the supplemental modules are represented by their own specifications:
DOM Level 2 Views (http://www.w3.org/TR/DOM-Level-2-Views)
The Views module deals with interaction between an XML document and some type of stylesheet or presentation aspect. For instance, the same XML document could be styled by multiple CSS or XSL stylesheets; each of the resulting documents would be a view. It turns out that this module isn’t that useful, as Java tools for document transformation are plentiful; most parsers won’t support this module.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Serialization
Typically, I’d come up with some clever example for using DOM at this point, and use it to demonstrate how the API works. However, DOM leaves a rather gaping hole, and filling that hole proves to be a good DOM tutorial, as well as having practical value. This hole, of course, is serialization. Serialization is the process of taking an XML document in memory, represented as a DOM tree, and writing it to disk (or to a stream).
If you’re lucky enough to have a parser that implements the DOM Level 3 Load and Save module, then outputting a DOM tree isn’t a problem for you. Most parsers don’t provide that support—or slap experimental all over it—and it becomes a real problem for DOM programming.
Before you can serialize a DOM tree representing some XML, though, you need to read that XML in the first place. Since you’ll usually be reading XML from a file, I’ll show you how to do just that. Example 5-1 is a sample class that takes an XML filename, and loads the document into a DOM tree, represented by the org.w3c.dom.Document interface.
Example . This test class reads in an XML document and loads it into a DOM tree
package javaxml3;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;

// Parser import
import org.apache.xerces.parsers.DOMParser;

public class SerializeTester {

  // File to read XML from
  private File inputXML;

  // File to serialize XML to
  private File outputXML;

  public SerializeTester