Cover | Table of Contents
http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I’ll use it to illustrate several important concepts. <?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/"
xmlns:l="http://purl.org/rss/1.0/modules/link/"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<!--Generated by Blogger v5.0-->
<channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
<title>Neil Gaiman's Journal</title>
<link>http://www.neilgaiman.com/journal/journal.asp</link>
<description>Neil Gaiman's Journal</description>
<dc:date>2005-04-30T01:57:38Z</dc:date>
<dc:language>en-US</dc:language>
<admin:generatorAgent rdf:resource="http://www.blogger.com/" />
<admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
<items>
<rdf:Seq>
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
</rdf:Seq>
</items>
</channel>
<!-- and so on... -->
</rdf:RDF>http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I’ll use it to illustrate several important concepts. <?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/"
xmlns:l="http://purl.org/rss/1.0/modules/link/"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<!--Generated by Blogger v5.0-->
<channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
<title>Neil Gaiman's Journal</title>
<link>http://www.neilgaiman.com/journal/journal.asp</link>
<description>Neil Gaiman's Journal</description>
<dc:date>2005-04-30T01:57:38Z</dc:date>
<dc:language>en-US</dc:language>
<admin:generatorAgent rdf:resource="http://www.blogger.com/" />
<admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
<items>
<rdf:Seq>
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
<rdf:li
rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
</rdf:Seq>
</items>
</channel>
<!-- and so on... -->
</rdf:RDF>http://www.neilgaiman.com). It uses a lot of RSS syntax, which I’ll cover in Chapter 12 in detail.http://www.w3.org). If you don’t recall hearing much about XML 1.1, it’s no surprise; XML 1.1 was largely about Unicode conformance, and really didn’t affect XML as a whole that much, particularly for document authors and programmers not working with unusual character sets.version="1.0" and haven’t needed to change that yet. If you want to understand more about the intricacies of Unicode and XML 1.1, check out the complete specification at http://www.w3.org/TR/xml11.http://www.w3.org/XML and see what looks interesting. ELEMENT definitions (covered in this section) and ATTRIBUTE definitions (covered in the next section). An element definition begins with the
ELEMENT keyword, following the standard
<! opening of a DTD tag, and then the name of the element. Following that name is the
content model of the element. The content model is generally within parentheses and specifies what content can be included within the element. Take the item element, from the RSS 0.91 DTD (http://my.netscape.com/publish/formats/rss-0.91.dtd) as an example: <!ELEMENT item (title | link | description)*>
item element, there may be a title element, a link element, or a description element nested within that item. The “or” relationship is indicated by the pipe (
|) symbol; the OR applies to all elements within a group, indicated by the parentheses. In other words, for the grouping (title
|
link
|
description), one and only one of title, link, or description may appear. The asterisk after the grouping indicates a recurrence. Table 2-1 lists the complete set of DTD recurrence modifiers.<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:dw="http://www.ibm.com/developerWorks/"
elementFormDefault="unqualified"
attributeFormDefault="unqualified" version="4.0">
xsd prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the
dw namespace is defined; this particular example is from the IBM DeveloperWorks XML article template, and dw is used for DeveloperWorks-specific constructs.attributeFormDefault and elementFormDefault are set to "unqualified". This allows XML instance documents to omit namespace declarations on elements and attributes. Qualifications are a fairly tricky idea, largely because attributes in XML do not fall into the default namespace; they must explicitly be assigned to a namespace. For a lot more on qualification, check out the relevant portion of the XML Schema specification at http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-schema.version attribute is given a value of "4.0". This is used to indicate the version of this particular schema, not of the XML Schema specification being used. The namespace assigned to the http://www.oasis-open.org/home/index.php), RELAX NG is still seen as almost a grassroots effort to compete with—or at least provide an alternative to—XML Schema. Whatever you think about the political standing of RELAX NG, though, any good XML programmer should have RELAX NG in her constraint toolkit.
grammar element:<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<!-- Content model for XML -->
</grammar>datatypeLibrary lets the schema know where to pull data types (covered in the “Data types” section later) from, when you type elements and attributes. You don’t have to put this on your root element, but you’ll find that’s the best place to locate the reference; otherwise, you end up burying it somewhere in the middle of your schema, and that’s a maintenance pain.
http://relaxng.org/ns/structure/1.0).element keyword, and nestings within an XML document are represented by nestings with the RELAX NG schema:<element name="phonebook">
<element name="entry">
<element name="firstName">
<text/>
</element>
<element name="firstName">
<text/>
</element>
<!-- etc... -->
</element>
</element>http://www.saxproject.org). SAX is what makes insertion of this application-specific code into events possible. The interfaces provided in the SAX package are an important part of any programmer’s toolkit for handling XML. Even though the SAX classes are small and few in number, they provide a critical framework for Java and XML to operate within. Solid understanding of how they help in accessing XML data is critical to effectively leveraging XML in your Java programs. startElement event. It provides information about that event, such as the name of the element, its attributes, and so on, and then your code gets to respond. You, as a programmer, have to write code for each event that is important to you—from the start of a document to a comment to the end of an element. This process is summed up in Figure 3-1.
ContentHandlerSAXTreeViewer class. This utility uses SAX to parse an XML document, and displays the document visually as a Swing JTree. org.xml.sax.XMLReader interface; remember, this is why you downloaded a SAX-compliant parser in the first place.
the org.xml.sax.XMLReader interface for all SAX-compliant XML parsers to implement. For example, the Xerces SAX parser implementation, org.apache.xerces.parsers.SAXParser, implements the XMLReader interface. If you have access to the source of your parser, you should see the same interface implemented in your parser’s main SAX parser class. Each XML parser must have one class (and sometimes has more than one) that implements this interface, and that is the class you need to instantiate to allow for parsing XML: // Instantiate a Reader XMLReader reader = new org.apache.xerces.parsers.SAXParser( ); // Do something with the parser reader.parse(uri);
XMLReader isn’t called Parser. In fact, it was in SAX 1.0, and then so many changes were introduced that the class had to be deprecated and renamed. As a result, you’ll call the
parse( ) method on the XMLReader class. org.xml.sax.helpers.XMLReaderFactory to get away from this:XMLReader reader = XMLReaderFactory.createXMLReader( );
org.xml.sax.driver system property, and you can get your vendor’s XMLReader implementation, without importing your vendor’s classes:java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
[MyClassName]org.xml.sax.ContentHandler, org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver. ContentHandler and ErrorHandler. I’ll leave discussion of DTDHandler and EntityResolver for the next chapter; it is enough for now to understand that EntityResolver and DTDHandler work just like the other handlers, but just group different behaviors.
setContentHandler( ), setErrorHandler( ), setDTDHandler( ), and setEntityResolver( ), all on the
XMLReader class (see Figure 3-4). Then the reader invokes the callback methods on the appropriate handlers during parsing.
SAXTreeViewer example, start by implementing the
ContentHandler interface. ContentHandler, as the name implies, details events related to the content of an XML document: elements, attributes, character data, etc. Add the following class to the end of your SAXTreeViewer.java source listing: class JTreeHandler implements ContentHandler {
/** Tree Model to add nodes to */
private DefaultTreeModel treeModel;
/** Current node to add sub-nodes to */
private DefaultMutableTreeNode current;
public JTreeHandler(DefaultTreeModel treeModel,
DefaultMutableTreeNode base) {
this.treeModel = treeModel;
this.current = base;
}
// ContentHandler callback implementations
}ContentHandler interface for handling parsing events, SAX provides an
ErrorHandler interface that can be implemented to treat various error conditions that may arise during parsing (see Figure 3-10).
SAXParseException. This object holds the line number where the trouble was encountered, the URI of the document being treated (which could be the parsed document or an external reference within that document), and normal exception details such as a message and a printable stack trace. In addition, each of these methods can throw a
SAXException. This may seem a bit odd at first: an exception handler that throws an exception? Keep in mind that each handler receives a parsing exception. This might be a warning that should not cause the parsing process to stop or an error that needs to be resolved for parsing to continue; however, the callback may need to perform system I/O or another operation that can throw another exception, and the method needs to be able to send any problems resulting from these actions up the application chain. setValidating( )) or a set of classes (javax.xml.validation) to handle validation; you might expect to call a similar method—setValidation( ) or something similar—to initiate validation in SAX. But then, there’s also namespace awareness, dealt with quite a bit in Chapter 2 (and Chapter 3, with respect to Q names and local names—maybe setNamespaceAwareness( )? But what about schema validation? And setting the location of a schema to validate on, if the document doesn’t specify one? There’s also low-level behavior, like telling the parser what to do with entities (parse them? don’t parse them?), how to handle strings, and a lot more. As you can imagine, dealing with each of these could cause real API bloat, adding 20 or 30 methods to SAX’s XMLReader class. And, even worse, each time a new setting was needed (perhaps for the next type of constraint model supported? How about setRelaxNGSchema( )?), the SAX API would have to add a method or two, and re-release a new version. Clearly, this isn’t a very effective approach to API design.setValidating( )) or a set of classes (javax.xml.validation) to handle validation; you might expect to call a similar method—setValidation( ) or something similar—to initiate validation in SAX. But then, there’s also namespace awareness, dealt with quite a bit in Chapter 2 (and Chapter 3, with respect to Q names and local names—maybe setNamespaceAwareness( )? But what about schema validation? And setting the location of a schema to validate on, if the document doesn’t specify one? There’s also low-level behavior, like telling the parser what to do with entities (parse them? don’t parse them?), how to handle strings, and a lot more. As you can imagine, dealing with each of these could cause real API bloat, adding 20 or 30 methods to SAX’s XMLReader class. And, even worse, each time a new setting was needed (perhaps for the next type of constraint model supported? How about setRelaxNGSchema( )?), the SAX API would have to add a method or two, and re-release a new version. Clearly, this isn’t a very effective approach to API design.Object argument for the parser to use; for instance, certain types of handlers are set by specifying a URI and supplying the Object that implements that handler’s interface. A
feature is a setting that is either on (true) or off (false). Several obvious examples come to mind: namespace awareness and validation, for example.ContentHandler), and how to deal with error conditions (ErrorHandler). Both of these are concerned specifically with the data in an XML document. What I haven’t talked about is the process by which the parser goes outside of the document and gets data. For example, consider a simple entity reference in an XML document: <FM> <P>Text placed in the public domain by Moby Lexical Tools, 1992.</P> <P>SGML markup by Jon Bosak, 1992-1994.</P> <P>XML version by Jon Bosak, 1996-1998.</P> <P>&usage-terms;</P> </FM>
<!ENTITY usage-terms
SYSTEM "http://www.newInstance.com/entities/usage-terms.xml">usage-terms entity reference will be expanded (in this case, to “This work may be freely copied and distributed worldwide.”, as seen in Figure 4-1).
org.xml.sax.EntityResolver. This interface does exactly what it says: resolves entities. More important, it allows you to get involved in the entity resolution process. The interface defines only a single method, as shown in Figure 4-2.
XMLReader instance through
setEntityResolver( ). Once that’s done, every time the reader comes across an entity reference, it passes the public ID and system ID for that entity to the EntityResolver, I’m going to cruise through
DTDHandler (also in org.xml.sax). In almost nine years of extensive SAX and XML programming, I’ve used this interface only once—in writing JDOM (covered in Chapter 9)—and even then, it was a rather obscure case. Still, if you work with unparsed entities often, are into parser internals, or just want to get into every nook and cranny of the SAX API, then you need to know about DTDHandler. The interface is shown in all its simplicity in Figure 4-4.
DTDHandler interface allows you to receive notification when a reader encounters an unparsed entity or notation declaration. Of course, both of these events occur in DTDs, not XML documents, which is why this is called DTDHandler. The two methods listed in Figure 4-4 do exactly what you would expect. The first reports a notation declaration, including its name, public ID, and system ID. Remember the NOTATION structure in DTDs? (Flip back to Chapter 2 if you’re unclear.) <!NOTATION jpeg SYSTEM "images/jpeg">
<!ENTITY stars_logo SYSTEM "http://www.nhl.com/img/team/dal38.gif"
NDATA jpeg>DTDHandler and register it with your reader through the XMLReader’s
setDTDHandler( ) method. This is generally useful when writing low-level applications that must either reproduce XML content (such as an XML editor), or when you want to build up some Java representation of a DTD’s constraints (such as in a data binding implementation). In most other situations, it isn’t something you will need very often.
SAX is interface-driven, you have to do a lot of tedious work to get started with an XML-based application. For example, when you write your ContentHandler implementation, you have to implement each and every method of that interface, even if you aren’t inserting behavior into each callback. If you need an ErrorHandler, you add three more method implementations; using DTDHandler? That’s a few more. A lot of times, though, you’re writing lots of no-operation methods, as you only need to interact with a couple of key callbacks. org.xml.sax.helpers.DefaultHandler can be a real boon in these situations. This class doesn’t define any behavior of its own; however, it does implement
ContentHandler, ErrorHandler,
EntityResolver, and DTDHandler, and provides empty implementations of each method of each interface. So you can have a single class (call it, for example, MyHandlerClass) that extends DefaultHandler. You then only override the callback methods you’re concerned with. You might implement startElement( ), characters( ), endElement( ), and fatalError( ), for example. In any combination of implemented methods, though, you’ll save tons of lines of code for methods you don’t need to provide action for, and make your code a lot clearer too. Then, the argument to setErrorHandler( ), setContentHandler( ), and setDTDHandler( ) would be the same instance of this MyHandlerClass. DefaultHandler instance to setEntityResolver( ) as well, although (as I’ve already said) I discourage mixing EntityResolver implementations in with these other handlers. org.xml.sax.ext. In some cases, you’ll have to download these directly from the SAX web site (http://www.saxproject.org), although most parsers will include these in the parser download.
org.xml.sax.ext.LexicalHandler. This handler provides methods that can receive notification of several lexical events in an XML document, such as comments, entity declarations, DTD declarations, and CDATA sections. In ContentHandler, these lexical events are essentially ignored, and you just get the data and declarations without notification of when or how they were provided. CDATA section or not. However, if you are working with an XML editor, serializer, or other component that must know the exact format of the input document—and not just its contents—then the LexicalHandler can really help you out. import statement for org.xml.sax.ext.LexicalHandler to your SAXTreeViewer.java source file. Once that’s done, you can add LexicalHandler to the implements clause in the nonpublic class JTreeContentHandler in that source file: class JTreeHandler implements ContentHandler, ErrorHandler, LexicalHandler {startDTD( ) and endDTD( ) callbacks (I’ve coded up versions appropriate for SAXTreeViewer here): public void startDTD(String name, String publicID,
String systemID)
throws SAXException {
DefaultMutableTreeNode dtdReference =
new DefaultMutableTreeNode("DTD for '" + name + "'");
if (publicID != null) {
DefaultMutableTreeNode publicIDNode =
new DefaultMutableTreeNode("Public ID: '" + publicID + "'");
dtdReference.add(publicIDNode);
}
if (systemID != null) {
DefaultMutableTreeNode systemIDNode =
new DefaultMutableTreeNode("System ID: '" + systemID + "'");
dtdReference.add(systemIDNode);
}
current.add(dtdReference);
}
public void endDTD( ) throws SAXException {
// No action needed here
}http://www.saxproject.org), you can add some fairly advanced behavior to your SAX applications. This will also get you in the mindset of using SAX as a pipeline of events, rather than a single layer of processing. org.xml.sax.XMLFilter class that comes in the basic SAX download, and should be included with any parser distribution supporting SAX 2. This class extends the XMLReader interface, and adds two new methods to that class, as shown in Figure 4-8.
XMLReaders through this filtering mechanism, you can build up a processing chain, or
pipeline, of events. To understand what I mean by a pipeline, you first need to understand the normal flow of a SAX parse:
http://www.w3.org). Whereas SAX is public domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard—just like the actual XML specification. The DOM is designed to represent the content and model of XML documents across all programming languages and tools. On top of that specification, there are several
language bindings. These bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification. http://www.w3.org/TR/REC-DOM-Level-1. Level 1 details the functionality and navigation of content within a document. http://www.w3.org). Whereas SAX is public domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard—just like the actual XML specification. The DOM is designed to represent the content and model of XML documents across all programming languages and tools. On top of that specification, there are several
language bindings. These bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification. http://www.w3.org/TR/REC-DOM-Level-1. Level 1 details the functionality and navigation of content within a document. http://www.w3.org/TR/DOM-Level-2-Core. This is actually the recommendation for the DOM Core; all the supplemental modules are represented by their own specifications:
http://www.w3.org/TR/DOM-Level-2-Views)org.w3c.dom.Document interface.package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
// Parser import
import org.apache.xerces.parsers.DOMParser;
public class SerializeTester {
// File to read XML from
private File inputXML;
// File to serialize XML to
private File outputXML;
public SerializeTester