BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Java and XML
Java and XML By Brett McLaughlin
January 1900
Pages: 495

Cover | Table of Contents


Table of Contents

Chapter 1: Introduction
XML. These three letters have brought shivers to almost every developer in the world today at some point in the last two years. While those shivers were often fear at another acronym to memorize, excitement at the promise of a new technology, or annoyance at another source of confusion for today's developer, they were shivers all the same. Surprisingly, almost every type of response was well merited with regard to XML. It is another acronym to memorize, and in fact brings with it a dizzying array of companions: XSL, XSLT, PI, DTD, XHTML, and more. It also brings with it a huge promise: what Java did for portability of code, XML claims to do for portability of data. Sun has even been touting the rather ambitious slogan "Java + XML = Portable Code + Portable Data" in recent months. And yes, XML does bring with it a significant amount of confusion. We will seek to unravel and demystify XML, without being so abstract and general as to be useless, and without diving in so deeply that this becomes just another droll specification to wade through. This is a book for you, the Java developer, who wants to understand the hype and use the tools that XML brings to the table.
Today's web application now faces a wealth of problems that were not even considered ten years ago. Systems that are distributed across thousands of miles must perform quickly and flawlessly. Data from heterogeneous systems, databases, directory services, and applications must be transferred without a single decimal place being lost. Applications must be able to communicate not only with other business components, but other business systems altogether, often across companies as well as technologies. Clients are no longer limited to thick clients, but can be web browsers that support HTML, mobile phones that support the Wireless Application Protocol (WAP), or handheld organizers with entirely different markup languages. Data, and the transformation of that data, has become the crucial centerpiece of every application being developed today.
XML offers a way for programmers to meet all of these requirements. In addition, Java developers have an arsenal of APIs that enable them to use XML and its many companions without ever leaving a Java Integrated Development Environment (IDE). If this sounds a little too good to be true, keep reading. You will walk through the pitfalls of the various Java APIs as well as look at some of the bleeding-edge developments in the XML specification and the Java APIs for XML. Through it all, we will take a developer's view. This is not a book about why you should use XML, but rather how you should use it. If there are offerings in the specification that are not of much use, details of why will be clearly given and we will move on; if something is of great value, we'll spend some extra time on it. Throughout, we will focus on using XML as a tool, not using it as a buzzword or for the sake of having the latest toy. With that in mind, let's begin to talk about what XML is.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is It?
XML is the Extensible Markup Language . Like its predecessor SGML, XML is a meta-language used to define other languages. However, XML is much simpler and more straightforward than SGML. XML is a markup language that specifies neither the tag set nor the grammar for that language. The tag set for a markup language defines the markup tags that have meaning to a language parser. For example, HTML has a strict set of tags that are allowed. You may use the tag <TABLE> but not the tag <CHAIR>. While the first tag has a specific meaning to an application using the data, and is used to signify the start of a table in HTML, the second tag has no specific meaning, and although most browsers will ignore it, unexpected things can happen when it appears. That is because when HTML was defined, the tag set of the language was defined with it. With each new version of HTML, new tags are defined. However, if a tag is not defined, it may not be used as part of the markup language without generating an error when the document is parsed. The grammar of a markup language defines the correct use of the language's tags. Again, let's use HTML as an example. When using the <TABLE> tag, several attributes may be included, such as the width, the background color, and the alignment. However, you cannot define the TYPE of the table because the grammar of HTML does not allow it.
XML, by defining neither the tags nor the grammar, is completely extensible; thus its name. If you choose to use the tag <TABLE> and then nest within that tag several <CHAIR> tags, you may do so. If you wish to define a TYPE attribute for the <CHAIR> tag, you may do that also. You could even use tags named after your children or co-workers if you so desired! To demonstrate, let's take a look at the XML file shown in Example 1.1.
Example 1.1. A Sample XML File
<?xml version="1.0"?>

<dining-room>
    <table type="round" wood="maple">
        <manufacturer>The Wood Shop</manufacturer>
        <price>$1999.99</price>
    </table>

    <chair wood="maple">
        <quantity>2</quantity>
        <quality>excellent</quality>
        <cushion included="true">
            <color>blue</color>
        </cushion>
    </chair>

    <chair wood="oak">
        <quantity>3</quantity>
        <quality>average</quality>
    </chair>
</dining-room>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Use It?
All of the great ideas XML has brought to us are not much use without some tools to use these ideas within our familiar programming environments. Luckily, XML has been paired with Java since its inception, and Java boasts the most complete set of APIs available to allow use of XML directly within Java code. While C, C++, and Perl are quickly catching up, Java continues to set the standard on how to use XML from applications. There are two basic stages that occur in an XML document's lifecycle from an application point of view, as shown in Figure 1.1. First, the document is parsed, and then the data within it is manipulated.
Figure 1.1: The application view of an XML document lifecycle
As Java developers, we are fortunate to have simple ways to handle these tasks and more.
SAX is the Simple API for XML. It provides an event-based framework for parsing XML data, which is the process of reading through the document and breaking down the data into usable parts; at each step of the way, SAX defines events that can occur. For example, SAX defines an org.xml.sax.ContentHandler interface that defines methods such as startDocument( ) and endElement( ). Implementing this interface allows complete control over these portions of the XML parsing process. There is a similar interface for handling errors and lexical constructs. A set of errors and warnings is defined, allowing handling of the various situations that can occur in XML parsing, such as an invalid document, or one that is not well-formed. Behavior can be added to customize the parsing process, resulting in very application-specific tasks being available for definition, all with a standard interface into XML documents. For the SAX API documentation and other information on SAX, visit http://www.megginson.com/SAX.
Before continuing, it is important to clear up a common misconception about SAX. SAX is often mistaken for an XML parser. We even discuss SAX here as providing a means to parse XML data. However, SAX provides a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Should I Use It?
So now you've managed to sort through the alphabet soup of XML-related technologies. You even have realized that there may be more to XML than just another way to build a presentation layer. But you aren't quite sure where XML fits in with the applications you are building at work. You aren't positive that you could convince your boss to let you spend time learning more about XML, because you don't know how it could help make a better application. You even are thinking about trying to evaluate some tools to use XML, but you aren't sure where to start.
If this is the situation you find yourself in, excited about a new technology but confused as to where to go next, then read on! In this section, we begin to cast XML in the light of real-world applications, and give you a reason to use XML in your applications today. We will first look at how XML is being used today in applications, and we'll give you the information to convince that boss of yours that "everybody's doing it." Next we will take a look at support for XML and related technologies, all in light of Java applications. In Java, there is a wealth of available parsers, transformers, publishing engines, and frameworks designed specifically for XML. Finally, we will spend some time looking at where XML is going and try to anticipate how it will affect applications six months and a year from now. This is the information to use to convince your boss's boss that XML can not only keep you even with your competitors, but give your company the leading edge in your industry, and help get you that next promotion!
Even if you have been convinced that XML is a great technology, and that it is taking the world by storm, we have yet to mention why this book is about Java and XML, rather than just XML alone. Java is, in fact, the ideal counterpart for XML, and the reason can be summed up in a single phrase: Java is portable code, and XML is portable data. Taken separately, both technologies are wonderful, but have limitations. Java requires the developer to dream up formats for network data and formats for presentation, and to use technologies like JavaServer Pages™ (JSP) that do not provide a real separation of content and presentation layers. XML is simply metadata, and without programs like parsers and XSL processors, is essentially "vapor-ware." However, Java and XML matched together fill in the gaps in the application development picture.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
With our whirlwind tour of XML technologies and the Java APIs to manipulate them complete, we are ready to dive into more detail. We will spend the next two chapters detailing XML syntax and how XML can be used in web applications. This will give us the understanding of XML data that we need in order to create, format, parse, and manipulate it within our applications. In the next chapter, creating an XML document will be detailed, and further definition will be given of what it means for an XML document to be well-formed.
One last important note before we begin; if you skimmed the rest of the chapter, please take a moment and read this paragraph carefully. XML has been surrounded with confusion and misinformation since its inception. This book proceeds with the assumption that you are taking XML at face value, and not carrying any of those assumptions around with you, particularly ones about XML being designed for presentation. In other words, we are going to focus on XML as data. We will not refer to XML documents as data that is about to be presented, or information we can transform, but rather as simple data. This important concept may surprise you a bit, as most people still think of presentation when they think of XML. However, as Java developers, we need to treat XML as data and nothing more. We will spend the larger portion of this book not formatting XML, but merely parsing and manipulating it. The power of XML is transmitting data from system to system, application to application, and business to business. Trying to remove any preconceptions about what XML can do for you can help make this book more enjoyable, as well as show you a few ways to use XML you may not have considered.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Creating XML
Now that you have a greater understanding of XML, how it can be used, and some of the Java APIs available, it's time to turn concepts into practice. Although this book is not by any means a definitive guide to XML syntax, or even an XML reference, it would be impossible to discuss how to parse and manipulate XML documents without first being able to create those documents. In addition, the Java APIs for handling XML all assume a fair amount of familiarity with XML syntax and structure, as well as with the design patterns that go into creating an XML document, constraining it, and transforming it. Therefore we look at each of these tasks before discussing the corresponding Java APIs.
To begin, we will take a closer look at XML syntax in this chapter. Starting with the very basic XML constructs, we will discuss what a well-formed XML document is and how to create one. The various XML rules and syntactical "gotchas" will be covered to help you build XML documents that are not only legal, but can be used in realistic applications. All this work will set the stage for writing our first Java program in the next chapter to understand how parsing XML works, and how Java provides callbacks into the parsing process.
If you have ever read a chapter or even a book on a programming language's syntax, you probably realize it is usually pretty dry reading. To try and avoid this, we will look at syntax in a bit of a different light than you may be used to. Rather than starting with a simple one- or two-line XML file and adding to it, which typically makes for a lengthy, useless file at the end of the exercise, we will look at a complete, usable, relatively complex XML file. The file we will use is a portion of the actual XML document that represents the table of contents page for this book. We will walk through this document line by line, examining the different constructs. What a lot of syntactical discussions ignore is that in the real world, you almost never get to see the simple files that are so often used as examples; instead, you see complex files that don't make any sense to you, even after reading a book. You should get used to seeing an XML file with all its constructs, and begin to learn its structure through practical examples. Hopefully this makes the discussion at least a little more applicable for you, if not somewhat less dry.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An XML Document
As promised, we begin with a practical, real-world example of an XML document that represents a portion of this book's table of contents, shown in Example 2.1.
Example 2.1. An XML File
<?xml version="1.0"?>
<?xml-stylesheet href="XSL\JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL\JavaXML.wml.xsl" type="text/xsl" 
                 media="wap"?>
<?cocoon-process type="xslt"?>
<!DOCTYPE JavaXML:Book SYSTEM "DTD\JavaXML.dtd">

<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
 <JavaXML:Title>Java and XML</JavaXML:Title>
 <JavaXML:Contents>

  <JavaXML:Chapter focus="XML">
   <JavaXML:Heading>Introduction</JavaXML:Heading>
   <JavaXML:Topic subSections="7">What Is It?</JavaXML:Topic>
   <JavaXML:Topic subSections="3">How Do I Use It?</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Why Should I Use It?</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:Chapter focus="XML">
   <JavaXML:Heading>Creating XML</JavaXML:Heading>
   <JavaXML:Topic subSections="0">An XML Document</JavaXML:Topic>
   <JavaXML:Topic subSections="2">The Header</JavaXML:Topic>
   <JavaXML:Topic subSections="6">The Content</JavaXML:Topic>
   <JavaXML:Topic subSections="1">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:Chapter focus="Java">
   <JavaXML:Heading>Parsing XML</JavaXML:Heading>
   <JavaXML:Topic subSections="3">Getting Prepared</JavaXML:Topic>
   <JavaXML:Topic subSections="3">SAX Readers</JavaXML:Topic>
   <JavaXML:Topic subSections="9">Content Handlers</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Error Handlers</JavaXML:Topic>
   <JavaXML:Topic subSections="0">
     A Better Way to Load a Parser
   </JavaXML:Topic>
   <JavaXML:Topic subSections="4">"Gotcha!"</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:SectionBreak/>

  <JavaXML:Chapter focus="Java">
   <JavaXML:Heading>Web Publishing Frameworks</JavaXML:Heading>
   <JavaXML:Topic subSections="4">Selecting a Framework</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Installation</JavaXML:Topic>
   <JavaXML:Topic subSections="3">
     Using a Publishing Framework
   </JavaXML:Topic>
   <JavaXML:Topic subSections="2">XSP</JavaXML:Topic>
   <JavaXML:Topic subSections="3">Cocoon 2.0 and Beyond</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

 </JavaXML:Contents>

 <JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>

</JavaXML:Book>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Header
The first syntax we look at is XML itself. An XML document can be broken into two basic pieces: the header, which gives an XML parser and XML applications information about how to handle the document, and the content, which is the XML data itself. Although this is a fairly loose division, it will help us differentiate the instructions to applications within an XML document from the XML content itself, and is an important distinction to understand. In our example, we will begin with the first several lines, which lead up to the JavaXML:Book element. These initial lines, excluding the JavaXML:Book element, make up the document header. The term "header" is not a formal term defined in the XML specification, but is commonly used in the XML community, and we will use it in this book to denote these initial lines of an XML document.
The first statement you will see in any XML document is an XML instruction. XML instructions are actually a specific subset of processing instructions (PIs), which we talked about in the last chapter. Remember that we said PIs are generally passed on from the parser to the calling application, and handled there. However, PIs that specify their target as xml are intended for the XML parser itself. They specify the version of XML being used, a stylesheet, or other information that a parser may need to know to properly parse XML data. Here is an XML instruction:
<?xml version="1.0" standalone="no"?>
Like any other PI, it is of the form <?target instruction?>, and in this case it specifies that XML Version 1.0 is being used and that the document is not a standalone XML document. Notice that the instruction is not necessarily a single keyword=value pair; in this case, both the version and whether the document needs to be paired with an external document or documents are specified. By specifying that it is not a standalone document, a parser knows that an external DTD must be used to determine if the XML document is valid. If this were set to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Content
With our header worked out, we now can move on to the actual data content in our XML document. This consists of all the elements, attributes, and textual data within these constructs.
The root element is the highest-level element in the XML document, and must be the first opening tag and the last closing tag within the document. It provides a reference point that enables an XML parser or XML-aware application to recognize a beginning and end to an XML document. In our example, the root element is <JavaXML:Book>:
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/" >

  <!-- Content of XML Document -->

</JavaXML:Book>
This tag and its matching closing tag surround all other data content within the XML document. XML specifies that there may only be one root element in a document. In other words, the root element must enclose all other elements within the document. Aside from this requirement, a root element does not differ from any other XML element. It's important to understand this, because XML documents can reference and include other XML documents. In these cases, the root element of the referenced document becomes an enclosed element in the referring document, and must be handled normally by an XML parser. Defining root elements as standard XML elements without special properties or behavior allows document inclusion to work seamlessly.
Although we will not delve deeply into XML namespaces here, you should note the use of a namespace in the root element. You probably observed that all of the XML elements' names are prefixed with JavaXML. In our XML example, it may be necessary later to include portions of other O'Reilly books. Because each of these books may also have <Chapter>, <Heading>, or <Topic> tags, the document must be designed and constructed in a way to avoid namespace collision problems with other documents. The XML namespaces specification nicely solves this problem. Because our XML document represents a specific book, and no other XML document should represent the same book, using a prefix like
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
With this primer on creating XML documents, we are ready to begin writing our first Java code. In the next chapter, we will take a look at using the Simple API for XML (SAX). Starting with a simple program to parse through our XML document, we will learn how PIs, elements, attributes, and other XML constructs are handled within the XML parsing process. Along with each step, we will provide Java code to perform specific actions, beginning with a simple program to print out our XML document. This will start the extensive process of learning how to manipulate all of the various components of an XML document, and how to use this information within Java applications.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Parsing XML
With two solid chapters of introduction behind us, we are ready to code! By now you have seen the numerous acronyms that make up the world of XML, you have delved into the language itself, and you should be familiar with an XML document. This chapter takes the next step, and the first on our path of Java programming, by demonstrating how an XML document is parsed and how we can access the parsed data from within Java code.
One of the first things you will have to do when dealing with XML programmatically is take an XML document and parse it. As the document is parsed, the data in the document becomes available to the application using the parser, and suddenly we are within an XML-aware application! If this all sounds a little too simple to be true, it almost is. In this chapter, we will look closely at how an XML document is parsed. Using a parser within an application and how to feed that parser your document's data will be covered. Then we will look at the various callbacks that are available within the parsing lifecycle. These events are the points where application-specific code can be inserted and data manipulation can occur.
In addition to looking at how parsers work, we will also begin our exploration of the Simple API for XML (SAX) in this chapter. SAX is what makes these parsing callbacks available. The interfaces provided in the SAX package will become an important part of our toolkit for handling XML. Even though the SAX classes are small and few in number, everything else in our discussions of XML is based on these classes. A solid understanding of how they help us access XML data is critical to effectively leveraging XML in your Java programs.
There are several items that we should take care of before beginning to code. First, you must obtain an XML parser. Writing a parser for XML is a serious task, and there are several efforts going on to provide excellent XML parsers. We are not going to detail the process of actually writing an XML parser here; rather, we will discuss the applications that wrap this parsing behavior, focusing on using existing tools to manipulate XML data. This results in better and faster programs, as we do not seek to reinvent what is already available. After selecting a parser, we must ensure that a copy of the SAX classes is on hand. These are easy to locate, and are key to our Java code being able to process XML. Finally, we will need an XML document to parse. Then, on to the code!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Getting Prepared
There are several items that we should take care of before beginning to code. First, you must obtain an XML parser. Writing a parser for XML is a serious task, and there are several efforts going on to provide excellent XML parsers. We are not going to detail the process of actually writing an XML parser here; rather, we will discuss the applications that wrap this parsing behavior, focusing on using existing tools to manipulate XML data. This results in better and faster programs, as we do not seek to reinvent what is already available. After selecting a parser, we must ensure that a copy of the SAX classes is on hand. These are easy to locate, and are key to our Java code being able to process XML. Finally, we will need an XML document to parse. Then, on to the code!
The first step in getting ready to code Java that uses XML is locating and obtaining the parser you want to use. We briefly talked about this process in Chapter 1, and listed various XML parsers that could be used. To ensure that your parser works with all of the examples in the book, you should verify your parser's compliance with the XML specification. Because of the variety of parsers available and the rapid pace of change within the XML community, all of the details about which parsers have what compliance levels are beyond the scope of this book. You should consult the parser's vendor and visit the web sites previously given for this information.
In the spirit of the open source community, all of the examples in this book will use the Apache Xerces parser. Freely available in binary and source form at http://xml.apache.org, this C- and Java-based parser is already one of the most widely contributed-to parsers available. In addition, using an open source parser such as Xerces allows you to send questions or bug reports to the parser's authors, resulting in a better product, as well as helping you use the software quickly and correctly. To subscribe to the general list and request help on the Xerces parser, send a blank email to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SAX Readers
Without spending any further time on the preliminaries, let's begin to code. Our first program will be able to take an XML file as a command-line parameter, and parse that file. We will build document callbacks into the parsing process so that we can display events in the parsing process as they occur, which will give us a better idea of what exactly is going on "under the hood."
The first thing we need to do is get an instance of a class that conforms to the SAX org.xml.sax.XMLReader interface. This interface defines parsing behavior and allows us to set features and properties, which we will look at in Chapter 5. For those of you familiar with SAX 1.0, this interface replaces the org.xml.sax.Parser interface.
SAX provides an interface that all SAX-compliant XML parsers should implement. This allows SAX to know exactly what methods are available for callback and use within an application. For example, the Xerces main SAX parser class, org.apache.xerces.parsers.SAXParser, implements the org.xml.sax.XMLReader interface. If you have access to the source of your parser, you should see the same interface implemented in your parser's main SAX parser class. Each XML parser must have one class (sometimes more!) that implements this interface, and that is the class we need to instantiate to allow us to parse XML:
XMLReader parser = 
  new SAXParser(  );

// Do something with the parser
parser.parse(uri);
For those of you new to SAX entirely, it may be a bit confusing not to see the instance variable we used named reader or XMLReader. While that would be a normal convention, the SAX 1.0 classes defined the main parsing interface as Parser, and a lot of legacy code has variables named parser because of that naming. This interface was deprecated because of the large number of changes required for namespace and feature and properties support, but the naming convention is still a good one, as parser does indicate the purpose of the instance variable.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Content Handlers
In order to let our application do something useful with XML data as it is being parsed, we must register handlers with the SAX parser. A handler is nothing more than a set of callbacks that SAX defines to let us interject application code at important events within a document's parsing. Realize that these events will take place as the document is parsed, not after the parsing has occurred. This is one of the reasons that SAX is such a powerful interface: it allows a document to be handled sequentially, without having to first read the entire document into memory. We will later look at the Document Object Model (DOM), which has this limitation.
There are four core handler interfaces defined by SAX 2.0: org.xml.sax.ContentHandler , org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver. In this chapter, we discuss ContentHandler, which allows standard data-related events within an XML document to be handled, and take a first look at ErrorHandler, which receives notifications from the parser when errors in the XML data are found. DTDHandler will be examined in Chapter 5. We briefly discuss EntityResolver at various points in the text; it is enough for now to understand that EntityResolver works just like the other handlers, and is built specifically for resolving external entities specified within an XML document. Custom application classes that perform specific actions within the parsing process can implement each of these interfaces. These implementation classes can be registered with the parser with the methods setContentHandler( ), setErrorHandler( ), setDTDHandler( ), and setEntityResolver( ). Then the parser invokes the callback methods on the appropriate handlers during parsing.
For our example, we want to implement the ContentHandler interface. This interface defines several important methods within the parsing lifecycle that our application can react to. First we need to add the appropriate import statements to our source file (including the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Error Handlers
In addition to providing the ContentHandler interface for handling parsing events, SAX provides an ErrorHandler interface that can be implemented to treat various error conditions that may arise during parsing. This class works in the same manner as the document handler we have already constructed, but only defines three callback methods. Through these three methods, all possible error conditions are handled and reported by SAX parsers.
Each method receives information about the error or warning that has occurred through a SAXParseException . This object holds the line number that trouble was encountered on, the URI of the document being treated, which could be the parsed document or an external reference within that document, and normal exception details such as a message and a printable stack trace. In addition, each method can throw a SAXException. This may seem a bit odd at first; an exception handler that throws an exception? Keep in mind that what each handler receives is a parsing exception. This can be a warning that should not cause the parsing process to stop or an error that needs to be resolved for parsing to continue; however, the callback may need to perform system I/O or another operation that can throw an exception, and it needs to be able to bubble this exception up the application chain. It can do this through the SAXException the method is allowed to throw.
For example, consider an error handler that receives error notifications and writes those errors to an error log. This method needs to be able to either append to or create an error log on the local filesystem. If a warning were to occur within the process of parsing an XML document, the warning would be reported to this method. The intent of the warning would be to give information to the callback and then continue parsing the document. However, if the error handler could not write to the log file, it might need to notify the parser and application that all parsing should stop. This can be done by catching any I/O exceptions and re-throwing these to the calling application, thus causing any further document parsing to stop. This common scenario is why error handlers must be able to throw exceptions (see Example 3.3).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Better Way to Load a Parser
Although we now have a successful demonstration of SAX parsing, there is a glaring problem with our code. Let's take a look again at how we obtain an instance of XMLReader:
try {
        // Instantiate a parser
        XMLReader parser = 
            new SAXParser(  );
            
        // Register the content handler
        parser.setContentHandler(contentHandler);
        
        // Register the error handler
        parser.setErrorHandler(errorHandler);
            
        // Parse the document
        parser.parse(uri);
        
    } catch (IOException e) {
        System.out.println("Error reading URI: " + e.getMessage(  ));
    } catch (SAXException e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
Do you see anything that rubs you wrong? Let's look at another line of our code that may give you a hint:
// Import your vendor's XMLReader implementation here
import org.apache.xerces.parsers.SAXParser;
We have to explicitly import our vendor's XMLReader implementation, and then instantiate that implementation directly. The problem here is not the difficulty of this task, but that we have broken one of Java's biggest tenets: portability. Our code cannot run or even be compiled on a platform that does not use the Apache Xerces parser. In fact, it is conceivable that an updated version of Xerces might even change the name of the class used here! Our "portable" Java code is no longer very portable.
What is preferred is to request an instance of a class by the name of the implementation class. This allows a simple String parameter to be changed in your source code. Luckily, this facility is available in SAX 2.0. The org.xml.sax.helpers.XMLReaderFactory class provides the method you should be looking for:
/**
 * Attempt to create an XML reader from a class name.
 *
 * <p>Given a class name, this method attempts to load
 * and instantiate the class as an XML reader.</p>
 *
 * @return A new XML reader.
 * @exception org.xml.sax.SAXException If the class cannot be
 *            loaded, instantiated, and cast to XMLReader.
 * @see #createXMLReader(  )
 */
public static XMLReader createXMLReader (String className)
    throws SAXException {
    
    // Implementation
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
"Gotcha!"
Before leaving our introduction to parsing XML documents, there are a few pitfalls to make you aware of. These "gotchas" will help you avoid common programming mistakes when using SAX, and we will discuss more of these for other APIs in the appropriate sections.
For those of you who are unlucky enough not to have a parser with SAX 2.0 support, don't despair. First, you always have the option of changing parsers; keeping current on SAX standards is an important part of an XML parser's responsibility, and if your vendor is not doing this, you may have other concerns to address with them as well. However, there are certainly cases where you are forced to use a parser because of legacy code or applications; in these situations, you are still not "left out in the cold."
SAX 2.0 includes a helper class, org.xml.sax.helpers.ParserAdapter , which can actually cause a SAX 1.0 Parser implementation to behave like a SAX 2.0 XMLReader implementation. This handy class takes in a 1.0 Parser implementation as an input parameter and then can be used in the stead of that implementation. It allows a ContentHandler to be set, and handles all namespace callbacks properly. The only feature loss you will see is that skipped entities will not be reported, as this capability was not available in a 1.0 implementation in any form, and cannot be emulated by a 2.0 adapter class. The sample class would be used as shown in Example 3.6.
Example 3.6. Using a SAX 1.0 Parser as a 2.0 XMLReader
try {
    // Register a parser with SAX
    Parser parser = 
        ParserFactory.makeParser(
            "org.apache.xerces.parsers.SAXParser");
            
    ParserAdapter myParser = new ParserAdapter(parser);
                                        
    // Register the document handler
    myParser.setContentHandler(contentHandler);
    
    // Register the error handler
    myParser.setErrorHandler(errHandler);            
        
    // Parse the document      
    myParser.parse(uri);
    
} catch (ClassNotFoundException e) {
    System.out.println(
        "The parser class could not be found.");
} catch (IllegalAccessException e) {
    System.out.println(
        "Insufficient privileges to load the parser class.");
} catch (InstantiationException e) {
    System.out.println(
        "The parser class could not be instantiated.");
} catch (ClassCastException e) {
    System.out.println(
        "The parser does not implement org.xml.sax.Parser");
} catch (IOException e) {
    System.out.println("Error reaading URI: " + e.getMessage(  ));
} catch (SAXException e) {
    System.out.println("Error in parsing: " + e.getMessage(  ));
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
You should now have a solid understanding of the SAX interfaces and how they interact with an XML parser and the parsing process, with regard to a non-validated XML document. These interfaces are key to the rest of our discussions and Java code, as we will expand on our knowledge of SAX and add additional SAX classes to our example program. In the next chapter, we will look at how an XML document can be validated, and cover an XML document's DTD and schema. These will teach you how to constrain an XML document, and then in the chapter after that, we will look at implementing validation in our example parsing code.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Constraining XML
Learning to use XML, both for data representation and within Java applications, is an iterative process. In fact, almost every time you learn something about XML or one of its sister technologies, you will find that it gives you tools to learn yet another subset of the XML picture. Because there are so many XML-related projects and specifications, you will be hard-pressed to "know all there is to know" about XML; and just when you think you do, new versions of things you had down will come out, and you will get to start all over again! However, the more you do understand about the various components that make up the XML technology space, the better equipped you will be to add additional components to your programming toolkit. In keeping with this idea, we will now drop out of the Java programming language and return to XML-related specifications.
Chapter 2 and Chapter 3 should have given you the information and skills to create a well-formed XML document and then manipulate that document to a limited degree within Java. You also should begin to have a basic idea of how XML documents are parsed, and how the SAX Java classes aid in this process. In this chapter, we will discuss constraining the XML documents we have been creating. We will look at how Java can use these constraints in the parsing process in the next chapter.
Before assuming that you want to know about DTDs and XML Schema, it is only fair to help you understand why we should spend time on these specifications. There are some XML users and technologists who argue that there is never a need for constraining XML and ensuring document validity. Remember, we have already said that an XML document that is valid meets all the constraints that are set upon the document in the referenced DTD or schema. Also recall that a document can be well-formed, but still not be valid. So why go to the trouble to create a DTD or schema that does nothing but impose additional rules on your XML data?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Constrain XML Data?
Before assuming that you want to know about DTDs and XML Schema, it is only fair to help you understand why we should spend time on these specifications. There are some XML users and technologists who argue that there is never a need for constraining XML and ensuring document validity. Remember, we have already said that an XML document that is valid meets all the constraints that are set upon the document in the referenced DTD or schema. Also recall that a document can be well-formed, but still not be valid. So why go to the trouble to create a DTD or schema that does nothing but impose additional rules on your XML data?
As a Java developer, you have hopefully had lots of experience commenting your code, both with Javadoc and inline comments. At some point in your career, you were probably lectured on the importance of these comments; someone may have to read your code, someone may have to maintain your code, someone may actually have to understand your code. If you are involved in open source projects, the importance of commenting rises to even higher levels. And at some point, you probably rushed a project to completion to meet tight deadlines, and weren't exactly verbose in your comments. Then about three months later, another developer left with the task of supporting your project came to you and asked what this block of code did, or how that task was accomplished. Hopefully, you rattled off the correct explanation, but more likely you looked at him blankly and couldn't remember how you managed that particular feat of coding wizardry. At that point, you learned the value of documentation.
Now XML data is certainly not code, and simply because of the element nesting and other syntactical rules, it is almost always easier to understand than a snippet of complex Java code. However, don't be so sure that your outlook on data representation is the same outlook that other content authors may have. The simple XML file in Example 4.1 is an excellent example.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Document Type Definitions
As we have just discussed, an XML document is not very usable without an accompanying DTD. Just as XML can effectively describe data, the DTD makes this data usable in a variety of ways by many different programs by defining the structure of the data. In this section, we will look at the constructs for a DTD. We will again use as an example the XML representation of a portion of the table of contents for this book, and we will go through the process of constructing a DTD for the XML table of contents document.
The DTD's job is to define how data must be formatted. It must define each allowed element in an XML document, the allowed attributes, and possibly the acceptable attribute values for each element, the nesting and occurrences of each element, and any external entities. In fact, DTDs can specify quite a few other things about an XML document, but these basics are what we will focus on. We will learn the constructs that a DTD offers by applying them to and constraining our example XML file from Chapter 2. Because we will be referring to that file often throughout this chapter, it is reprinted here in Example 4.3.
Example 4.3. Table of Contents XML File
<?xml version="1.0"?>
<?xml-stylesheet href="XSL\JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL\JavaXML.wml.xsl" type="text/xsl" 
                 media="wap"?>
<?cocoon-process type="xslt"?>
<!DOCTYPE JavaXML:Book SYSTEM "DTD\JavaXML.dtd">

<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
 <JavaXML:Title>Java and XML</JavaXML:Title>
 <JavaXML:Contents>

  <JavaXML:Chapter focus="XML">
   <JavaXML:Heading>Introduction</JavaXML:Heading>
   <JavaXML:Topic subSections="7">What Is It?</JavaXML:Topic>
   <JavaXML:Topic subSections="3">How Do I Use It?</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Why Should I Use It?</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:Chapter focus="XML">
   <JavaXML:Heading>Creating XML</JavaXML:Heading>
   <JavaXML:Topic subSections="0">An XML Document</JavaXML:Topic>
   <JavaXML:Topic subSections="2">The Header</JavaXML:Topic>
   <JavaXML:Topic subSections="6">The Content</JavaXML:Topic>
   <JavaXML:Topic subSections="1">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:Chapter focus="Java">
   <JavaXML:Heading>Parsing XML</JavaXML:Heading>
   <JavaXML:Topic subSections="3">Getting Prepared</JavaXML:Topic>
   <JavaXML:Topic subSections="3">SAX Readers</JavaXML:Topic>
   <JavaXML:Topic subSections="9">Content Handlers</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Error Handlers</JavaXML:Topic>
   <JavaXML:Topic subSections="0">
     A Better Way to Load a Parser
   </JavaXML:Topic>
   <JavaXML:Topic subSections="4">"Gotcha!"</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

  <JavaXML:SectionBreak/>

  <JavaXML:Chapter focus="Java">
   <JavaXML:Heading>Web Publishing Frameworks</JavaXML:Heading>
   <JavaXML:Topic subSections="4">Selecting a Framework</JavaXML:Topic>
   <JavaXML:Topic subSections="4">Installation</JavaXML:Topic>
   <JavaXML:Topic subSections="3">
     Using a Publishing Framework
   </JavaXML:Topic>
   <JavaXML:Topic subSections="2">XSP</JavaXML:Topic>
   <JavaXML:Topic subSections="3">Cocoon 2.0 and Beyond</JavaXML:Topic>
   <JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
  </JavaXML:Chapter>

 </JavaXML:Contents>

 <JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>

</JavaXML:Book>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Schema
XML Schema is a new working draft at the W3C that seeks to remedy many of the problems and limitations of DTDs. In addition to handling more accurate representations of XML structure constraints, XML Schema also seeks to provide an XML styling to the process of constraining data. Schemas are actually XML documents that are both well-formed and valid. This allows parsers and other XML-aware applications to handle XML Schema documents in a fashion similar to other XML documents, as opposed to employing special techniques as are needed for handling DTD documents.
Because XML Schema is both a new and young specification, as well as still incomplete, we will only lightly treat it here. In addition, details of the implementation of XML Schema are subject to change; if you have problems with some of the examples, you may want to consult the latest version of the XML Schema proposal at http://www.w3.org/TR/xmlschema-1/ andhttp://www.w3.org/TR/xmlschema-2/. You should also be aware that many XML parsers do not support XML Schema, or support only portions of the specification. You should check with your vendor to verify the level of XML Schema support provided by your XML parser.
There is also a difference between a valid document and a schema-valid document. Because XML Schema is not part of the XML 1.0 specification, a document that conforms to a given schema is not said to be valid. Only an XML document conforming to a referenced DTD through a DOCTYPE declaration is considered a valid XML document. This has caused quite a bit of confusion in the XML community as to how to handle schema validation. In addition to the difference in terms of validity, an XML 1.0 parser or application does not have to perform schema validation, again because XML Schema is not in the 1.0 specification of XML. This means that even if your document has a schema reference, the document may not be validated against that schema, regardless of the parser's level of schema support. For these reasons, you should take care to determine when your parser will and will not validate, and specifically how it handles schema validation. For clarity, we will continue to use validity as the single term, representing either schema or DTD validity. It will be up to you to see whether a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
We have now looked at two ways to constrain our XML documents: the "old" way, by using DTDs, and the "new" way, using XML Schema. Hopefully, you also are beginning to see the importance of document constraints, particularly with regard to applications. If an application does not understand the type of information that an XML document should contain, manipulating and understanding the document's data becomes a much more difficult task. In the next chapter, we extend our knowledge of the SAX Java classes by looking at the facilities for accessing DTDs and schemas within our Java program. We will add to the parser the example program we built in Chapter 3, allowing the program to read through document constraints and report errors if the XML documents read are not valid, as well as examining the callbacks available within the validation process.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Validating XML
Your knowledge base and accompanying bag of XML tricks should be starting to feel a little more solid by now. You can create XML, use the Java SAX classes to parse through that XML, and now constrain that XML. This leads us to the next logical step: validating XML with Java. Without the ability to validate XML, business-to-business and inter-application communication becomes significantly more difficult; while constraints enable portability of our data, validity ensures its consistency. In other words, being able to constrain a document doesn't help much if we can't ensure that those constraints are enforced within our XML applications.
In this chapter, we will look at using additional SAX classes and interfaces to enforce validity constraints in our XML documents. We will examine how to set features and properties of a SAX-compliant parser, allowing easy configuration of validation, namespace handling, and other parser functionality. In addition, the errors and warnings that can occur with validating parsers will be detailed, filling in the blanks from earlier discussions on the SAX error handlers.
With the wealth of XML-related specifications and technologies emerging from the World Wide Web Consortium (W3C), adding support for any new feature or property of an XML parser has become difficult. Many parser implementations have added proprietary extensions or methods at the cost of the portability of the code. While these software packages may implement the SAX XMLReader interface, the methods for setting document and schema validation, namespace support, and other core features are not standard across parser implementations. To address this, SAX 2.0 defines a standard mechanism for setting important properties and features of a parser that allows the addition of new properties and features as they are accepted by the W3C without the use of proprietary extensions or methods.
Lucky for us, SAX 2.0 includes the methods needed for setting properties and features in the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Configuring the Parser
With the wealth of XML-related specifications and technologies emerging from the World Wide Web Consortium (W3C), adding support for any new feature or property of an XML parser has become difficult. Many parser implementations have added proprietary extensions or methods at the cost of the portability of the code. While these software packages may implement the SAX XMLReader interface, the methods for setting document and schema validation, namespace support, and other core features are not standard across parser implementations. To address this, SAX 2.0 defines a standard mechanism for setting important properties and features of a parser that allows the addition of new properties and features as they are accepted by the W3C without the use of proprietary extensions or methods.
Lucky for us, SAX 2.0 includes the methods needed for setting properties and features in the XMLReader interface. This means we have to change little of our existing code to request validation, set the namespace separator, and handle other feature and property requests. The methods used for these purposes are outlined in Table 5.1.
Table 5.1: Property and Feature Methods
Method
Returns
Parameters
Syntax
                              setProperty(  )
void
String propertyID,
Object value
parser.setProperty(
    "[Property URI]", 
    "[Object parameter]");
                              setFeature(  )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Output of XML Validation
Make sure your XML document, DTD, copyright file (if you created one), and compiled classes are assembled. You may then run the example program, and you might be surprised at the output (shown in Example 5.5).
Example 5.5. SAXParserDemo Output
D:\prod\JavaXML> java SAXParserDemo D:\prod\JavaXML\contents\contents.xml
Parsing XML File: D:\prod\JavaXML\contents\contents.xml


    * setDocumentLocator(  ) called
Parsing begins...
**Parsing Error**
  Line:    13
  URI:     file:/D:/prod/JavaXML/contents/contents.xml
  Message: Document root element "JavaXML:Book", must match DOCTYPE root 
           "JavaXML:Book".
This rather cryptic error is a significant problem when using DTDs and namespaces together. The error seems to be stating that the root specified in the DOCTYPE declaration (JavaXML:Book) does not match the root element of the document itself. But the root element is JavaXML:Book, right? Actually, it's not! By default, SAX 2.0 specifies that parsers must enable their namespace feature, making all SAX 2.0 parsers namespace-aware unless this feature is explicitly set to false. We did not change this default, so our XMLReader implementation is namespace aware. The unexpected result of this is that our root element is seen (by the parser) as Book, with the namespace prefix of JavaXML. But remember that XML 1.0 and DTDs cannot distinguish between a prefix and element name, so the root element the DTD expects to find is JavaXML:Book. When it finds Book, it reports the error above.
The only way to get around this rather annoying "feature" of SAX is to turn off namespace awareness on documents that are being validated by DTDs. Add in the following code to your SAXParserDemo source file:
try {
    // Instantiate a parser
    XMLReader parser = 
        XMLReaderFactory.createXMLReader(
            "org.apache.xerces.parsers.SAXParser");
        
    // Register the content handler
    parser.setContentHandler(contentHandler);
    
    // Register the error handler
    parser.setErrorHandler(errorHandler);    
    
    // Turn on validation
    parser.setFeature("http://xml.org/sax/features/validation",
                      true);       
                
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The DTDHandler Interface
The last core document handler that SAX provides registers callback methods during the process of reading and parsing an XML document's DTD. This interface does not define events that take place during the process of validation, but only those that occur during the process of reading the DTD. In fact, in our section on "gotchas" we will look at some of the confusion this distinction often causes. This handler behaves in the same manner as the ContentHandler and ErrorHandler interfaces that we looked at in Chapter 3, defining two callback methods that occur during the parsing process.
As important as XML document validation is, the events involved with reading the DTD document are not very significant. With only two callback methods, and both of those not commonly used, you will probably not find many uses for the DTDHandler interface unless you are writing an XML editor or IDE and need to build or process DTD documents for correct syntax and notation. We will look at the two callback methods provided by SAX here, but will not spend much time on their use, as they are not significant in our use of XML for non-editor type applications. For information on an optional SAX handler that can help in reading further DTD information, refer to the DeclHandler interface in Appendix A, under the org.xml.sax.ext package.
The first callback method, unparsedEntityDecl( ) , is invoked when a DTD has an entity declaration signifying that the XML parser should not parse a particular entity. Though we have not looked at an example of this, unparsed entities are common in XML documents that reference images or other binary data, such as media files. This method takes in the name of the entity, the public and system IDs, and the notation name of the entity. Notation names are another XML term we have not yet looked at. Consider the example of an XML document fragment that refers to an image, possibly representing a logo, shown in Example 5.9.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
"Gotcha!"
To continue with the theme of trying to provide some cautionary advice on your path to XML mastery, some additional pitfalls associated with XML validation are included here. These are often problems run into by newer XML developers, as the solutions are not immediately apparent. Take heed of them, as they have caused many a developer long hours of tedious debugging, or simple confusion at unexpected application output.
One of the most common misunderstandings about using SAX for validation is thinking that validating an XML document is contingent upon registering a SAX DTDHandler implementation with the XML parser. Often, time and effort are spent to implement the DTDHandler interface and register it with the parser, and time is not spent setting the validation feature of the parser. This mistake arises from a mistaken association between handling a DTD and actually using the DTD for validation. In this case, the DTD would be parsed, and all DTD callback events would occur (if any were needed). However, the XML document itself would not be validated, but simply parsed. Keep in mind that the output from parsing a valid XML document looks almost identical to output from a non-validated XML document; always be aware when validation is occurring to avoid application bugs:
try {
    // Instantiate a parser
    XMLReader parser = 
        XMLReaderFactory.createXMLReader(
            "org.apache.xerces.parsers.SAXParser");
        
    // Register the content handler
    parser.setContentHandler(contentHandler);
    
    // Register the error handler
    parser.setErrorHandler(errorHandler);

    // This has no effect on turning on validation!
               
    parser.setDTDHandler(dtdHandler);
               
    // Turn on validation
               
    parser.setFeature("http://xml.org/sax/features/validation", true);       
                
    // Turn off namespace awareness
    parser.setFeature("http://xml.org/sax/features/namespaces", false);              
        
    // Parse the document
    parser.parse(uri);
    
} catch (IOException e) {
    System.out.println("Error reading URI: " + e.getMessage(  ));
} catch (SAXException e) {
    System.out.println("Error in parsing: " + e.getMessage(  ));
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What's Next?
By now you should feel very comfortable with XML documents and how to constrain those documents. We have also looked at all of the major aspects of using the SAX interfaces and classes, and you should have a solid understanding of the parsing and validating lifecycle, as well as what document callbacks are available.