BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Java and XML
Java and XML, Second Edition Solutions to Real-World Problems By Brett McLaughlin
August 2001
Pages: 528

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Introductory chapters are typically pretty easy to write. In most books, you give an overview of the technology covered, explain a few basics, and try and get the reader interested. However, for this second edition of Java and XML, things aren't so easy. In the first edition, there were still a lot of people coming to XML, or skeptics wanting to see if this new type of markup was really as good as the hype. Over a year later, everyone is using XML in hundreds of ways. In a sense, you probably don't need an introduction. But I'll give you an idea of what's going to be covered, why it matters, and what you'll need to get up and running.
First, let me simply say that XML matters. I know that sounds like the beginning of a self-help seminar, but it's worth starting with. There are still many developers, managers, and executives who are afraid of XML. They are afraid of the perception that XML is "cutting-edge," and of XML's high rate of change. (This is a second edition, a year later, right? Has that much changed?) They are afraid of the cost of hiring folks like you and me to work in XML. Most of all, they are afraid of adding yet another piece to their application puzzles.
To try and assuage these fears, let me quickly run down the major reasons that you should start working with XML, today. First, XML is portable. Second, it allows an unprecedented degree of interoperability. And finally, XML matters. . . because it doesn't matter! If that's completely confusing, read on and all will soon make sense.
XML is portable. If you've been around Java long, or have ever wandered through Moscone Center at JavaOne, you've heard the mantra of Java: "portable code." Compile Java code, drop those .class or .jar files onto any operating system, and the code runs. All you need is a Java Runtime Environment (JRE) or Java Virtual Machine (JVM), and you're set. This has continually been one of Java's biggest draws, because developers can work on Linux or Windows workstations, develop and test code, and then deploy on Sparcs, E4000s, HP-UX, or anything else you could imagine.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Matters
First, let me simply say that XML matters. I know that sounds like the beginning of a self-help seminar, but it's worth starting with. There are still many developers, managers, and executives who are afraid of XML. They are afraid of the perception that XML is "cutting-edge," and of XML's high rate of change. (This is a second edition, a year later, right? Has that much changed?) They are afraid of the cost of hiring folks like you and me to work in XML. Most of all, they are afraid of adding yet another piece to their application puzzles.
To try and assuage these fears, let me quickly run down the major reasons that you should start working with XML, today. First, XML is portable. Second, it allows an unprecedented degree of interoperability. And finally, XML matters. . . because it doesn't matter! If that's completely confusing, read on and all will soon make sense.
XML is portable. If you've been around Java long, or have ever wandered through Moscone Center at JavaOne, you've heard the mantra of Java: "portable code." Compile Java code, drop those .class or .jar files onto any operating system, and the code runs. All you need is a Java Runtime Environment (JRE) or Java Virtual Machine (JVM), and you're set. This has continually been one of Java's biggest draws, because developers can work on Linux or Windows workstations, develop and test code, and then deploy on Sparcs, E4000s, HP-UX, or anything else you could imagine.
As a result, XML is worth more than a passing look. Because XML is simply text, it can obviously be moved between various platforms. Even more importantly, XML must conform to a specification defined by the World Wide Web Consortium (W3C) at http://www.w3.org. This means that XML is a standard. When you send XML, it conforms to this standard; when some other application receives it, the XML still conforms to that standard. The receiving application can count on that. This is essentially what Java provides: any JVM knows what to expect, and as long as code conforms to those expectations, it will run. By using XML, you get portable data. In fact, recently you may have heard the phrase "portable code, portable data" in reference to the combination of Java and XML. It's a good saying, because it turns out (as not all marketing-type slogans do) to be true.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Important?
Once you've accepted that XML can help you out, the next question is what part of it you need. As I mentioned earlier, there are literally hundreds of applications of XML, and trying to find the right one is not an easy task. I've got to pick out twelve or thirteen key topics from these hundreds, and manage to make them all applicable to you; not an easy task! Fortunately, I've had a year to gather feedback from the first edition of this book, and have been working with XML in production applications for well over two years now. That means that I've at least got an idea of what's interesting and useful. When you boil all the various XML machinery down, you end up with just a few categories.
An API is an application programming interface, and a low-level API is one that lets you deal directly with an XML document's content. In other words, there is little to no preprocessing, and you get raw XML content to work with. It is the most efficient way to deal with XML, and also the most powerful. At the same time, it requires the most knowledge about XML, and generally involves the most work to turn document content into something useful.
The two most common low-level APIs today are SAX, the Simple API for XML, and DOM, the Document Object Model. Additionally, JDOM (which is not an acronym, nor is it an extension of DOM) has gained a lot of momentum lately. All three of these are in some form of standardization (SAX as a de facto, DOM by the W3C, and JDOM by Sun), and are good bets to be long-lasting technologies. All three offer you access to an XML document, in differing forms, and let you do pretty much anything you want with the document. I'll spend quite a bit of time on these APIs, as they are the basis for everything else you'll do in XML. I've also devoted a chapter to JAXP, Sun's Java API for XML Processing, which provides a thin abstraction layer over SAX and DOM.
High-level APIs are the next step up the ladder. Instead of offering direct access to a document, they rely on low-level APIs to do that work for them. Additionally, these APIs present the document in a different form, either more user-friendly, or modeled in a certain way, or in some form other than a basic XML document structure. While these APIs are often easier to use and quicker to develop with, you may pay an additional processing cost while your data is converted to a different format. Also, you'll need to spend some time learning the API, most likely in addition to some lower-level APIs.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Essentials
Now you're ready to learn how to use Java and XML to their best. What do you need? I will address that subject, give you some basics, and then let you get after it.
I say this almost tongue in cheek; if you expect to get through this book with no OS (operating system) and no Java installation, you just might be in a bit over your head. Still, it's worth letting you know what I expect. I wrote the first half of this book and the examples for those chapters on a Windows 2000 machine, running both JDK 1.2 and JDK 1.3 (as well as 1.3.1). I did most of my compiling under Cygwin (from Cygnus), so I usually operate in a Unix-esque environment. The last half of the book was written on my (at the time) brand new Macintosh G4 running OS X. That system comes with JDK 1.3, and is a beauty, for those of you who are curious.
In any case, all the examples should work unchanged with Java 1.2 or above; I used no features of JDK 1.3. However, I did not write this code to compile under Java 1.1, as I felt using the Java 2 Collections classes was important. Additionally, if you're working with XML, you need to take a long hard look at updating your JDK if you're still on 1.1 (I know some of you have no choice). If you are stuck on a 1.1 JVM, you should be able to get the collections from Sun (http://java.sun.com), make some small modifications, and be up and running.
You will need an XML parser. One of the most important layers to any XML-aware application is the XML parser. This component handles the important task of taking a raw XML document as input and making sense of the document; it will ensure that the document is well-formed, and if a DTD or schema is referenced, it may be able to ensure that the document is valid. What results from an XML document being parsed is typically a data structure that can be manipulated and handled by other XML tools or Java APIs. I'm going to leave the detailed discussions of these APIs for later chapters. For now, just be aware that the parser is one of the core building blocks to using XML data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
Now you're probably ready to get on with it. In the next chapter, I'm going to give you a crash course in XML. If you're new to XML, or are shaky on the basics, this chapter will fill in the gaps. If you're an old hand to XML, I'd recommend you skim the chapter, and move on to the code in Chapter 3. In either case, get ready to dive into Java and XML; things get exciting from here on in.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Nuts and Bolts
With the introductions behind us, let's get to it. Before heading straight into Java, though, some basic structures must be laid down. These address a fundamental understanding of the concepts in XML and how the extensible markup language works. In other words, you need an XML primer. If you are already an XML expert, skim through this chapter to make sure you're comfortable with the topics addressed. If you're completely new to XML, on the other hand, this chapter can get you ready for the rest of the book without hours, days, or weeks of study.
You can use this chapter as a glossary while you read the rest of the book. I won't spend time in future chapters explaining XML concepts, in order to deal strictly with Java and get to some more advanced concepts. So if you hit something that completely befuddles you, check this chapter for information. And if you are still a little lost, I highly recommended that this book be read with a copy of Elliotte Harold and Scott Means' excellent book XML in a Nutshell (O'Reilly) open. That will give you all the information you need on XML concepts, and then I can focus on Java ones.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Basics
It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 2-1 shows a simple XML document that conforms to this specification. It's a portion of the XML table of contents for this book (I've only included part of it because it's long!). The complete file is included with the samples for the book, available online at http://www.oreilly.com/catalog/javaxml2 and http://www.newInstance.com. I'll use it to illustrate several important concepts.
Example 2-1. The contents.xml document
<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "DTD/JavaXML.dtd">

<!-- Java and XML Contents -->
<book xmlns="http://www.oreilly.com/javaxml2"
      xmlns:ora="http://www.oreilly.com"
>
  <title ora:series="Java">Java and XML</title>

  <!-- Chapter List -->
  <contents>
    <chapter title="Introduction" label="1">
      <topic name="XML Matters" />
      <topic name="What's Important" />
      <topic name="The Essentials" />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="Nuts and Bolts" label="2">
      <topic name="The Basics" />
      <topic name="Constraints" />
      <topic name="Transformations" />
      <topic name="And More..." />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="SAX" label="3">
      <topic name="Getting Prepared" />
      <topic name="SAX Readers" />
      <topic name="Content Handlers" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter> 
    <chapter title="Advanced SAX" label="4">
      <topic name="Properties and Features" />
      <topic name="More Handlers" />
      <topic name="Filters and Writers" />
      <topic name="Even More Handlers" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="DOM" label="5">
      <topic name="The Document Object Model" />
      <topic name="Serialization" />
      <topic name="Mutability" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter>           

    <!-- And so on... -->

  </contents>

  <ora:copyright>&OReillyCopyright;</ora:copyright>
</book>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Constraints
Next up to bat is dealing with constraining XML. If there's nothing you get out of this chapter other than the rationale behind constraining XML, then I'm a happy author. Because XML is extensible and can represent data in hundreds and even thousands of ways, constraints on a document provide meaning to those various formats. Without document constraints, it is impossible (in most cases) to tell what the data in a document means. In this section, I'm going to cover the two current standard means of constraining XML: DTDs (included in the XML 1.0 specification) and XML Schema (recently a standard put out by the W3C). Choose the one best suited for you.
An XML document is not very usable without an accompanying DTD (or schema). Just as XML can effectively describe data, the DTD makes this data usable for many different programs in a variety of ways by defining the structure of the data. In this section, I show you the most common constructs used within a DTD. I use the XML representation of a portion of the table of contents for this book as an example again, and go through the process of constructing a DTD for the XML table of contents document.
The DTD defines how data is formatted. It must define each allowed element in an XML document, the allowed attributes and possibly the acceptable attribute values for each element, the nesting and occurrences of each element, and any external entities. DTDs can specify many other things about an XML document, but these basics are what we will focus on. You will learn the constructs that a DTD offers by applying them to and constraining the XML file from Example 2-1. The complete DTD is shown in Example 2-3, which I'll refer to in this section.
Example 2-3. DTD for Example 2-1
<!ELEMENT book (title, contents, ora:copyright)>
<!ATTLIST book
          xmlns       CDATA  #REQUIRED
          xmlns:ora   CDATA  #REQUIRED
>
<!ELEMENT title (#PCDATA)>
<!ATTLIST title
          ora:series  (C | Java | Linux | Oracle | 
                      Perl | Web | Windows)   
                      #REQUIRED
>
<!ELEMENT contents (chapter+)>
<!ELEMENT chapter (topic+)>
<!ATTLIST chapter
          title       CDATA  #REQUIRED
          number      CDATA  #REQUIRED
>
<!ELEMENT topic EMPTY>
<!ATTLIST topic
          name        CDATA  #REQUIRED
>

<!-- Copyright Information -->
<!ELEMENT ora:copyright (copyright)>
<!ELEMENT copyright (year, content)>
<!ATTLIST copyright
          xmlns  CDATA  #REQUIRED
>
<!ELEMENT year EMPTY>
<!ATTLIST year
          value  CDATA  #REQUIRED
>
<!ELEMENT content (#PCDATA)>
<!ENTITY OReillyCopyright SYSTEM
   "http://www.newInstance.com/javaxml2/copyright.xml"
>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Transformations
As useful as XML transformations can be, they are not simple to implement. In fact, rather than trying to specify the transformation of XML in the original XML 1.0 specification, three separate recommendations have come out to define how transformations should occur. Although one of these (XPath) is also used in several other XML specifications, by far the most common use of the components I outline here is to transform XML from one format into another.
Because these three specifications are tied together tightly and almost always used in concert, there is rarely a clear distinction between them. This can often make for a discussion that is easy to understand, but not necessarily technically correct. In other words, the term XSLT, which refers specifically to extensible stylesheet transformations, is often applied to both extensible stylesheets (XSL) and XPath. In the same fashion, XSL is often used as a grouping term for all three technologies. In this section, I distinguish among the three recommendations, and remain true to the letter of the specifications outlining these technologies. However, in the interest of clarity, I use XSL and XSLT interchangeably to refer to the complete transformation process throughout the rest of the book. Although this may not follow the letter of these specifications, it certainly follows their spirit, as well as avoiding lengthy definitions of simple concepts when you already understand what I mean.
XSL is the Extensible Stylesheet Language. It is defined as a language for expressing stylesheets. This broad definition is broken down into two parts:
  • XSL is a language for transforming XML documents.
  • XSL is an XML vocabulary for specifying the formatting of XML documents.
The definitions are similar, but one deals with moving from one XML document form to another, while the other focuses on the actual presentation of content within each document. Perhaps a clearer definition would be to say that XSL handles the specification of how to transform a document from format A to format B. The components of the language handle the processing and identification of the constructs used to do this.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
And More...
Lest I mislead you into thinking that's all that there is to XML, I want to make sure that you realize there are a multitude of other XML-related technologies. I can't possibly get into them all here. You should take a quick glance at things like CSS (Cascading Style Sheets) and XHTML if you are working on web design. Document authors will want to find out more about XLink and XPointer (both of which I cover in Chapter 16). XQL (XML Query Language) will be of interest to database programmers. In other words, there's something XML for pretty much every technology space right now. Take a look at the W3C XML activity page at http://www.w3.org/XML and see what looks interesting.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
With some baseline knowledge of XML, you're ready to dive into the Java side of things. In the next chapter, I'll introduce you to SAX, the Simple API for XML. This is ground zero of the Java and XML APIs, and will get you started on seeing how you can use XML in your own Java applications. You'll learn how to read documents, set various options for DTD and schema validation, use namespace processing, and more, and understand when SAX is the right tool for a particular job. Fire up your editor and turn the page.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: SAX
When dealing with XML programmatically, one of the first things you have to do is take an XML document and parse it. As the document is parsed, the data in the document becomes available to the application using the parser, and suddenly you are within an XML-aware application! If this sounds a little too simple to be true, it almost is. This chapter describes how an XML document is parsed, focusing on the events that occur within this process. These events are important, as they are all points where application-specific code can be inserted and data manipulation can occur.
As a vehicle for this chapter, I'm going to introduce the Simple API for XML (SAX). SAX is what makes insertion of this application-specific code into events possible. The interfaces provided in the SAX package will become an important part of any programmer's toolkit for handling XML. Even though the SAX classes are small and few in number, they provide a critical framework for Java and XML to operate within. Solid understanding of how they help in accessing XML data is critical to effectively leveraging XML in your Java programs. In later chapters, we'll add to this toolkit other Java and XML APIs like DOM, JDOM, JAXP, and data binding. But, enough fluff; it's time to talk SAX.
There are a few items that you must have before beginning to code. They are:
  • An XML parser
  • The SAX classes
  • An XML document
First, you must obtain an XML parser. Writing a parser for XML is a serious task, and there are several efforts going on to provide excellent XML parsers, especially in the open source arena. I am not going to detail the process of actually writing an XML parser here; rather, I will discuss the applications that wrap this parsing behavior, focusing on using existing tools to manipulate XML data. This results in better and faster programs, as neither you nor I spend time trying to reinvent what is already available. After selecting a parser, you must ensure that a copy of the SAX classes is on hand. These are easy to locate, and are key to Java code's ability to process XML. Finally, you need an XML document to parse. Then, on to the code!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Getting Prepared
There are a few items that you must have before beginning to code. They are:
  • An XML parser
  • The SAX classes
  • An XML document
First, you must obtain an XML parser. Writing a parser for XML is a serious task, and there are several efforts going on to provide excellent XML parsers, especially in the open source arena. I am not going to detail the process of actually writing an XML parser here; rather, I will discuss the applications that wrap this parsing behavior, focusing on using existing tools to manipulate XML data. This results in better and faster programs, as neither you nor I spend time trying to reinvent what is already available. After selecting a parser, you must ensure that a copy of the SAX classes is on hand. These are easy to locate, and are key to Java code's ability to process XML. Finally, you need an XML document to parse. Then, on to the code!
The first step to coding Java that uses XML is locating and obtaining the parser you want to use. I briefly talked about this process in Chapter 1, and listed various XML parsers that could be used. To ensure that your parser works with all the examples in the book, you should verify your parser's compliance with the XML specification. Because of the variety of parsers available and the rapid pace of change within the XML community, all of the details about which parsers have what compliance levels are beyond the scope of this book. Consult the parser's vendor and visit the web sites previously given for this information.
In the spirit of the open source community, all of the examples in this book use the Apache Xerces parser. Freely available in binary and source form at http://xml.apache.org, this C- and Java-based parser is already one of the most widely contributed-to parsers available (not that hardcore Java developers like us care about C, though, right?). In addition, using an open source parser such as Xerces allows you to send questions or bug reports to the parser's authors, resulting in a better product, as well as helping you use the software quickly and correctly. To subscribe to the general list and request help on the Xerces parser, send a blank email to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SAX Readers
Without spending any further time on the preliminaries, it's time to code. As a sample to familiarize you with SAX, this chapter details the SAXTreeViewer class. This class uses SAX to parse an XML document supplied on the command line, and displays the document visually as a Swing JTree. If you don't know anything about Swing, don't worry; I don't focus on that, but just use it for visual purposes. The focus will remain on SAX, and how events within parsing can be used to perform customized action. All that really happens is that a JTree is used, which provides a nice simple tree model, to display the XML input document. The key to this tree is the DefaultMutableTreeNode class, which you'll get quite used to in using this example, as well as the DefaultTreeModel that takes care of the layout.
The first thing you need to do in any SAX-based application is get an instance of a class that conforms to the SAX org.xml.sax.XMLReader interface. This interface defines parsing behavior and allows us to set features and properties (which I'll cover later in this chapter). For those of you familiar with SAX 1.0, this interface replaces the org.xml.sax.Parser interface.
This is a good time to point out that SAX 1.0 is not covered in this book. While there is a very small section at the end of this chapter explaining how to convert SAX 1.0 code to SAX 2.0, you really are not in a good situation if you are using SAX 1.0. While the first edition of this book came out on the heels of SAX 2.0, it's now been well over a year since the API was released in a 2.0 final form. I strongly urge you to move on to Version 2 if you haven't already.
SAX provides an interface all SAX-compliant XML parsers should implement. This allows SAX to know exactly what methods are available for callback and use within an application. For example, the Xerces main SAX parser class,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Content Handlers
In order to let an application do something useful with XML data as it is being parsed, you must register handlers with the SAX parser. A handler is nothing more than a set of callbacks that SAX defines to let programmers insert application code at important events within a document's parsing. These events take place as the document is parsed, not after the parsing has occurred. This is one of the reasons that SAX is such a powerful interface: it allows a document to be handled sequentially, without having to first read the entire document into memory. Later, we will look at the Document Object Model (DOM), which has this limitation.
There are four core handler interfaces defined by SAX 2.0: org.xml.sax.ContentHandler , org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver. In this chapter, I will discuss ContentHandler and ErrorHandler. I'll leave discussion of DTDHandler and EntityResolver for the next chapter; it is enough for now to understand that EntityResolver works just like the other handlers, and is built specifically for resolving external entities specified within an XML document. Custom application classes that perform specific actions within the parsing process can implement each of these interfaces. These implementation classes can be registered with the reader using the methods setContentHandler( ) , setErrorHandler( ), setDTDHandler( ), and setEntityResolver( ). Then the reader invokes the callback methods on the appropriate handlers during parsing.
For the SAXTreeViewer example, a good start is to implement the ContentHandler interface. This interface defines several important methods within the parsing lifecycle that our application can react to. Since all the necessary import statements are in place (I cheated and put them in already), all that is needed is to code an implementation of the ContentHandler interface. For simplicity, I'll do this as a nonpublic class, still within the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Error Handlers
In addition to providing the ContentHandler interface for handling parsing events, SAX provides an ErrorHandler interface that can be implemented to treat various error conditions that may arise during parsing. This class works in the same manner as the document handler already constructed, but defines only three callback methods. Through these three methods, all possible error conditions are handled and reported by SAX parsers. Here's a look at the ErrorHandler interface:
public interface ErrorHandler {
    public abstract void warning (SAXParseException exception)
		throws SAXException;
    public abstract void error (SAXParseException exception)
		throws SAXException;
    public abstract void fatalError (SAXParseException exception)
		throws SAXException;
}
Each method receives information about the error or warning that has occurred through a SAXParseException. This object holds the line number where the trouble was encountered, the URI of the document being treated (which could be the parsed document or an external reference within that document), and normal exception details such as a message and a printable stack trace. In addition, each method can throw a SAXException. This may seem a bit odd at first; an exception handler that throws an exception? Keep in mind that each handler receives a parsing exception. This can be a warning that should not cause the parsing process to stop or an error that needs to be resolved for parsing to continue; however, the callback may need to perform system I/O or another operation that can throw an exception, and it needs to be able to send any problems resulting from these actions up the application chain. It can do this through the SAXException the error handler callback is allowed to throw.
As an example, consider an error handler that receives error notifications and writes those errors to an error log. This callback method needs to be able to either append to or create an error log on the local filesystem. If a warning were to occur within the process of parsing an XML document, the warning would be reported to this method. The intent of the warning is to give information to the callback and then continue parsing the document. However, if the error handler could not write to the log file, it might need to notify the parser and application that all parsing should stop. This can be done by catching any I/O exceptions and rethrowing these to the calling application, thus causing any further document parsing to stop. This common scenario is why error handlers must be able to throw exceptions (see Example 3-2).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Gotcha!
Before leaving this introduction to parsing XML documents with SAX, there are a few pitfalls to make you aware of. These "gotchas" will help you avoid common programming mistakes when using SAX, and I will discuss more of these for other APIs in the appropriate sections.
For those of you who are forced to use a SAX 1.0 parser, perhaps in an existing application, don't despair. First, you always have the option of changing parsers; keeping current on SAX standards is an important part of an XML parser's responsibility, and if your vendor is not doing this, you may have other concerns to address with them as well. However, there are certainly cases where you are forced to use a parser because of legacy code or applications; in these situations, you are still not left out in the cold.
SAX 2.0 includes a helper class, org.xml.sax.helpers.ParserAdapter, which can actually cause a SAX 1.0 Parser implementation to behave like a SAX 2.0 XMLReader implementation. This handy class takes in a 1.0 Parser implementation as an argument and then can be used instead of that implementation. It allows a ContentHandler to be set (which is a SAX 2.0 construct), and handles all namespace callbacks properly (also a feature of SAX 2.0). The only functionality loss you will see is that skipped entities will not be reported, as this capability was not available in a 1.0 implementation in any form, and cannot be emulated by a 2.0 adapter class. Example 3-3 shows this behavior in action.
Example 3-3. Using SAX 1.0 with SAX 2.0 code constructs
try {
    // Register a parser with SAX
    Parser parser = 
        ParserFactory.makeParser(
            "org.apache.xerces.parsers.SAXParser");
            
    ParserAdapter myParser = new ParserAdapter(parser);
                                        
    // Register the document handler
    myParser.setContentHandler(contentHandler);
    
    // Register the error handler
    myParser.setErrorHandler(errHandler);            
        
    // Parse the document      
    myParser.parse(uri);
    
} catch (ClassNotFoundException e) {
    System.out.println(
        "The parser class could not be found.");
} catch (IllegalAccessException e) {
    System.out.println(
        "Insufficient privileges to load the parser class.");
} catch (InstantiationException e) {
    System.out.println(
        "The parser class could not be instantiated.");
} catch (ClassCastException e) {
    System.out.println(
        "The parser does not implement org.xml.sax.Parser");
} catch (IOException e) {
    System.out.println("Error reaading URI: " + e.getMessage( ));
} catch (SAXException e) {
    System.out.println("Error in parsing: " + e.getMessage( ));
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
Now that you have a taste of SAX, you are ready for some of the more advanced features of the API. These include setting properties and features, using validation and namespace processing, and the EntityResolver and DTDHandler interfaces. Additionally, you'll take a look at many less used (but still valuable) features of the Simple API for XML, as well as the optional add-ons to SAX, such as filters and the org.xml.sax.ext package. This should get those of you who are using SAX in applications up, running, and even flying past developers around you. That's always good. Keep that editor humming, and turn the page.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Advanced SAX
The last chapter was a good introduction to SAX. However, there are several more topics that will round out your knowledge of SAX. While I've called this chapter "Advanced SAX," don't be intimidated. It could just as easily be called "Less-Used Portions of SAX that are Still Important." In writing these two chapters, I followed the 80/20 principle. 80% of you will probably never need to use the material in this chapter, and Chapter 3 will completely cover your needs. However, for those power users out there working in XML day in and day out, this chapter covers some of the finer points of SAX that you'll need.
I'll start with a look at setting parser properties and features, and discuss configuring your parser to do whatever you need it to. From there, I'll move on to some more handlers: the EntityResolver and DTDHandler left over from the last chapter. At that point, you should have a comprehensive understanding of the standard SAX 2.0 distribution. However, we'll push on to look at some SAX extensions, beginning with the writers that can be coupled with SAX, as well as some filtering mechanisms. Finally, I'll introduce some new handlers to you, the LexicalHandler and DeclHandler, and show you how they are used. When all is said and done (including another "Gotcha!" section), you should be ready to take on the world with just your parser and the SAX classes. So slip into your shiny spacesuit and grab the flightstick—ahem. Well, I got carried away with the taking on the world. In any case, let's get down to it.
With the wealth of XML-related specifications and technologies emerging from the World Wide Web Consortium (W3C), adding support for any new feature or property of an XML parser has become difficult. Many parser implementations have added proprietary extensions or methods at the cost of code portability. While these software packages may implement the SAX XMLReader interface, the methods for setting document and schema validation, namespace support, and other core features are not standard across parser implementations. To address this, SAX 2.0 defines a standard mechanism for setting important properties and features of a parser that allows the addition of new properties and features as they are accepted by the W3C without the use of proprietary extensions or methods.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Properties and Features
With the wealth of XML-related specifications and technologies emerging from the World Wide Web Consortium (W3C), adding support for any new feature or property of an XML parser has become difficult. Many parser implementations have added proprietary extensions or methods at the cost of code portability. While these software packages may implement the SAX XMLReader interface, the methods for setting document and schema validation, namespace support, and other core features are not standard across parser implementations. To address this, SAX 2.0 defines a standard mechanism for setting important properties and features of a parser that allows the addition of new properties and features as they are accepted by the W3C without the use of proprietary extensions or methods.
Lucky for you and me, SAX 2.0 includes the methods needed for setting properties and features in the XMLReader interface. This means you have to change little of your existing code to request validation, set the namespace separator, and handle other feature and property requests. The methods used for these purposes are outlined in Table 4-1.
Table 4-1: Property and feature methods
Method
Returns
Parameters
Syntax
setProperty( )
void
String propertyID
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
More Handlers
In the last chapter, I showed you the ContentHandler and ErrorHandler interfaces and briefly mentioned the EntityResolver and DTDHandler interfaces as well. Now that you've got a good understanding of SAX basics, you're ready to look at these two other handlers. You'll find that you use EntityResolver every now and then (more if you're writing applications to be resold), and that the DTDHandler is something rarely ever pulled out of your bag of tricks.
The first of these new handlers is org.xml.sax.EntityResolver. This interface does exactly what it says: resolves entities (or at least declares a method that resolves entities, but you get the idea). The interface defines only a single method, and it looks like this:
public InputSource resolveEntity(String publicID, String systemID)
    throws SAXException, IOException;
You can create an implementation of this interface, and register it with your XMLReader instance (through setEntityResolver( ), not surprisingly). Once that's done, every time the reader comes across an entity reference, it passes the public ID and system ID for that entity to the resolveEntity( ) method of your implementation. Now you can change the normal process of entity resolution.
Typically, the XML reader resolves the entity through the specified public or system ID, whether it be a file, URL, or other resource. And if the return value from the resolveEntity( ) method is null, this process executes unchanged. As a result, you should always make sure that whatever code you add to your resolveEntity( ) implementation, it returns null in the default case. In other words, start with an implementation class that looks like Example 4-1.
Example 4-1. Simple implementation of EntityResolver
package javaxml2;

import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class SimpleEntityResolver implements EntityResolver {
    
    public InputSource resolveEntity(String publicID, String systemID)
        throws IOException, SAXException {
        
        // In the default case, return null
        return null;    
    }
}    
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Filters and Writers
At this point, I want to diverge from the beaten path. So far, I've detailed everything that's in a "standard" SAX application, from the reader to the callbacks to the handlers. However, there are a lot of additional features in SAX that can really turn you into a power developer, and take you beyond the confines of "standard" SAX. In this section, I'll introduce you to two of these: SAX filters and writers. Using classes both in the standard SAX distribution and available separately from the SAX web site (http://www.megginson.com/SAX), you can add some fairly advanced behavior to your SAX applications. This will also get you in the mindset of using SAX as a pipeline of events, rather than a single layer of processing. I'll explain this concept in more detail, but suffice it to say that it really is the key to writing efficient and modular SAX code.
First on the list is a class that comes in the basic SAX download from David Megginson's site, and should be included with any parser distribution supporting SAX 2.0. The class in question here is org.xml.sax.XMLFilter. This class extends the XMLReader interface, and adds two new methods to that class:
public void setParent(XMLReader parent);

public XMLReader getParent( );
It might not seem like there is much to say here; what's the big deal, right? Well, by allowing a hierarchy of XMLReader s through this filtering mechanism, you can build up a processing chain, or pipeline , of events. To understand what I mean by a pipeline, here's the normal flow of a SAX parse:
  • Events in an XML document are passed to the SAX reader.
  • The SAX reader and registered handlers pass events and data to an application.
What developers started realizing, though, is that it is simple to insert one or more additional links into this chain:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Even More Handlers
Now I want to show you two more handler classes that SAX offers. Both of these interfaces are no longer part of the core SAX distribution, and are located in the org.xml.sax.ext package to indicate they are extensions to SAX. However, most parsers (such as Apache Xerces) include these two classes for use. Check your vendor documentation, and if you don't have these classes, you can download them from the SAX web site. I warn you that not all SAX drivers support these extensions, so if your vendor doesn't include them, you may want to find out why, and see if an upcoming version of the vendor's software will support the SAX extensions.
The first of these two handlers is the most useful: org.xml.sax.ext.LexicalHandler . This handler provides methods that can receive notification of several lexical events such as comments, entity declarations, DTD declarations, and CDATA sections. In ContentHandler, these lexical events are essentially ignored, and you just get the data and declarations without notification of when or how they were provided.
This is not really a general-use handler, as most applications don't need to know if text was in a CDATA section or not. However, if you are working with an XML editor, serializer, or other component that must know the exact format of the input document, not just its contents, the LexicalHandler can really help you out. To see this guy in action, you first need to add an import statement for org.xml.sax.ext.LexicalHandler to your SAXTreeViewer.java source file. Once that's done, you can add LexicalHandler to the implements clause in the nonpublic class JTreeContentHandler in that source file:
class JTreeContentHandler implements ContentHandler, LexicalHandler {
    // Callback implementations
}
By reusing the content handler already in this class, our lexical callbacks can operate upon the JTree for visual display of these lexical callbacks. So now you need to add implementations for all the methods defined in
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Gotcha!
As you get into the more advanced features of SAX, you certainly don't reduce the number of problems you can get yourself into. However, these problems often become more subtle, which makes for some tricky bugs to track down. I'll point out a few of these common problems.
As I mentioned in the section on EntityResolvers, you should always ensure that you return null as a starting point for resolveEntity( ) method implementations. Luckily, Java ensures that you return something from the method, but I've often seen code like this:
    public InputSource resolveEntity(String publicID, String systemID)
        throws IOException, SAXException {

        InputSource inputSource = new InputSource( );

        // Handle references to online version of copyright.xml   
        if (systemID.equals(
            "http://www.newInstance.com/javaxml2/copyright.xml")) {
            inputSource.setSystemId(
                "file:///c:/javaxml2/ch04/xml/copyright.xml");
        }            
        
        // In the default case, return null
        return inputSource;    
    }
As you can see, an InputSource is created initially and then the system ID is set on that source. The problem here is that if no if blocks are entered, an InputSource with no system or public ID, as well as no specified Reader or InputStream, is returned. This can lead to unpredictable results; in some parsers, things continue with no problems. In other parsers, though, returning an empty InputSource results in entities being ignored, or in exceptions being thrown. In other words, return null at the end of every resolveEntity( ) implementation, and you won't have to worry about these details.
I've described setting properties and features in this chapter, their affect on validation, and also the DTDHandler interface. In all that discussion of DTDs and validation, it's possible you got a few things mixed up; I want to be clear that the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
That's plenty of information on the Simple API for SAX. Although there is certainly more to dig into, the information in this chapter and the last should have you ready for almost anything you'll run into. Of course, SAX isn't the only API for working with XML; to be a true XML expert you'll need to master DOM, JDOM, JAXP, and more. I'll start you on the next API in this laundry list, the Document Object Model (DOM), in the next chapter.
To introduce DOM, I'll start with the basics, much as the last chapter gave you a solid start on SAX. You'll find out about tree APIs and how DOM is significantly different from SAX, and see the DOM core classes. I'll show you a sample application that serializes DOM trees, and soon you'll be writing your own DOM code.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: DOM
In the previous chapters, I've talked about Java and XML in the general sense, but I have described only SAX in depth. As you may be aware, SAX is just one of several APIs that allow XML work to be done within Java. This chapter and the next will widen your API knowledge as I introduce the Document Object Model, commonly called the DOM. This API is quite a bit different from SAX, and complements the Simple API for XML in many ways. You'll need both, as well as the other APIs and tools in the rest of this book, to be a competent XML developer.
Because DOM is fundamentally different from SAX, I'll spend a good bit of time discussing the concepts behind DOM, and why it might be used instead of SAX for certain applications. Selecting any XML API involves tradeoffs, and choosing between DOM and SAX is certainly no exception. I'll move on to possibly the most important topic: code. I'll introduce you to a utility class that serializes DOM trees, something that the DOM API itself doesn't currently supply. This will provide a pretty good look at the DOM structure and related classes, and get you ready for some more advanced DOM work. Finally, I'll show you some problem areas and important aspects of DOM in the "Gotcha!" section.
The Document Object Model, unlike SAX, has its origins in the World Wide Web Consortium (W3C). Whereas SAX is public-domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard just like the actual XML specification. The DOM is not designed specifically for Java, but to represent the content and model of documents across all programming languages and tools. Bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.
In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into "levels" instead of versions. DOM Level One is an accepted recommendation, and you can view the completed specification at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Document Object Model
The Document Object Model, unlike SAX, has its origins in the World Wide Web Consortium (W3C). Whereas SAX is public-domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard just like the actual XML specification. The DOM is not designed specifically for Java, but to represent the content and model of documents across all programming languages and tools. Bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.
In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into "levels" instead of versions. DOM Level One is an accepted recommendation, and you can view the completed specification at http://www.w3.org/TR/REC-DOM-Level-1/. Level 1 details the functionality and navigation of content within a document. A document in the DOM is not just limited to XML, but can be HTML or other content models as well! Level Two, which was finalized in November of 2000, adds upon Level 1 by supplying modules and options aimed at specific content models, such as XML, HTML, and Cascading Style Sheets (CSS). These less-generic modules begin to "fill in the blanks" left by the more general tools provided in DOM Level 1. You can view the current Level 2 Recommendation at http://www.w3.org/TR/DOM-Level-2/. Level Three is already being worked on, and should add even more facilities for specific types of documents, such as validation handlers for XML, and other features that I'll discuss in Chapter 6.
Using the DOM for a specific programming language requires a set of interfaces and classes that define and implement the DOM itself. Because the methods involved are not outlined specifically in the DOM specification, and instead focus on the model of a document, language bindings must be developed to represent the conceptual structure of the DOM for its use in Java or any other language. These language bindings then serve as APIs for you to manipulate documents in the fashion outlined in the DOM specification.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Serialization
One of the most common questions about using DOM is, "I have a DOM tree; how do I write it out to a file?" This question is asked so often because DOM Levels 1 and 2 do not provide a standard means of serialization for DOM trees. While this is a bit of a shortcoming of the API, it provides a great example in using DOM (and as you'll see in the next chapter, DOM Level 3 seeks to correct this problem). In this section, to familiarize you with the DOM, I'm going to walk you through a class that takes a DOM tree as input, and serializes that tree to a supplied output.
Before I talk about outputting a DOM tree, I will give you information on getting a DOM tree in the first place. For the sake of example, all that the code in this chapter does is read in a file, create a DOM tree, and then write that DOM tree back out to another file. However, this still gives you a good start on DOM and prepares you for some more advanced topics in the next chapter.
As a result, there are two Java source files of interest in this chapter. The first is the serializer itself, which is called (not surprisingly) DOMSerializer.java. The second, which I'll start on now, is SerializerTest.java. This class takes in a filename for the XML document to read and a filename for the document to serialize out to. Additionally, it demonstrates how to take in a file, parse it, and obtain the resultant DOM tree object, represented by the org.w3c.dom.Document class. Go ahead and download this class from the book's web site, or enter in the code as shown in Example 5-1, for the SerializerTest class.
Example 5-1. The SerializerTest class
package javaxml2;

import java.io.File;
import org.w3c.dom.Document;

// Parser import
import org.apache.xerces.parsers.DOMParser;

public class SerializerTest {

    public void test(String xmlDocument, String outputFilename) 
        throws Exception {

        File outputFile = new File(outputFilename);
        DOMParser parser = new DOMParser( );

        // Get the DOM tree as a Document object

        // Serialize
    }

    public static void main(String[] args) {
        if (args.length != 2) {
            System.out.println(
                "Usage: java javaxml2.SerializerTest " +
                "[XML document to read] " +
                "[filename to write out to]");
            return;
        }

        try {
            SerializerTest tester = new SerializerTest( );
            tester.test(args[0], args[1]);
        } catch (Exception e) {
            e.printStackTrace( );
        }
    }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Mutability
One glaring omission in this chapter is the topic of modifying a DOM tree. That's not an accident; working with DOM is a lot more complex than working with SAX. Rather than drowning you in information, I wanted to give a clear picture of the various node types and structures used in DOM. In the next chapter, in addition to looking at some of the finer points of DOM Levels 2 and 3, I'll address the mutability of DOM trees, and in particular how to create DOM trees. So don't panic—help is on the way!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Gotcha!
Content preview·