Cover | Table of Contents
<TABLE> but not the tag
<CHAIR>. While the first tag has a specific
meaning to an application using the data, and is used to signify the
start of a table in HTML, the second tag has no specific meaning, and
although most browsers will ignore it, unexpected things can happen
when it appears. That is because when HTML was defined, the tag set
of the language was defined with it. With each new version of HTML,
new tags are defined. However, if a tag is not defined, it may not be
used as part of the markup language without generating an error when
the document is parsed. The grammar of a markup
language defines the correct use of the language's tags. Again,
let's use HTML as an example. When using the
<TABLE> tag, several attributes may be
included, such as the width, the background color, and the alignment.
However, you cannot define the TYPE of the table
because the grammar of HTML does not allow it.<TABLE> and then nest within that tag
several <CHAIR> tags, you may do so. If you
wish to define a TYPE attribute for the
<CHAIR> tag, you may do that also. You could
even use tags named after your children or co-workers if you so
desired! To demonstrate, let's take a look at the XML file
shown in Example 1.1.<?xml version="1.0"?>
<dining-room>
<table type="round" wood="maple">
<manufacturer>The Wood Shop</manufacturer>
<price>$1999.99</price>
</table>
<chair wood="maple">
<quantity>2</quantity>
<quality>excellent</quality>
<cushion included="true">
<color>blue</color>
</cushion>
</chair>
<chair wood="oak">
<quantity>3</quantity>
<quality>average</quality>
</chair>
</dining-room>
org.xml.sax.ContentHandler interface
that defines methods such as startDocument( ) and
endElement( ). Implementing this interface allows
complete control over these portions of the XML parsing process.
There is a similar interface for handling errors and lexical
constructs. A set of errors and warnings is defined, allowing
handling of the various situations that can occur in XML parsing,
such as an invalid document, or one that is not well-formed. Behavior
can be added to customize the parsing process, resulting in very
application-specific tasks being available for definition, all with a
standard interface into XML documents. For the SAX API documentation
and other information on SAX, visit
http://www.megginson.com/SAX.<?xml version="1.0"?>
<?xml-stylesheet href="XSL\JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL\JavaXML.wml.xsl" type="text/xsl"
media="wap"?>
<?cocoon-process type="xslt"?>
<!DOCTYPE JavaXML:Book SYSTEM "DTD\JavaXML.dtd">
<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
<JavaXML:Title>Java and XML</JavaXML:Title>
<JavaXML:Contents>
<JavaXML:Chapter focus="XML">
<JavaXML:Heading>Introduction</JavaXML:Heading>
<JavaXML:Topic subSections="7">What Is It?</JavaXML:Topic>
<JavaXML:Topic subSections="3">How Do I Use It?</JavaXML:Topic>
<JavaXML:Topic subSections="4">Why Should I Use It?</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:Chapter focus="XML">
<JavaXML:Heading>Creating XML</JavaXML:Heading>
<JavaXML:Topic subSections="0">An XML Document</JavaXML:Topic>
<JavaXML:Topic subSections="2">The Header</JavaXML:Topic>
<JavaXML:Topic subSections="6">The Content</JavaXML:Topic>
<JavaXML:Topic subSections="1">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:Chapter focus="Java">
<JavaXML:Heading>Parsing XML</JavaXML:Heading>
<JavaXML:Topic subSections="3">Getting Prepared</JavaXML:Topic>
<JavaXML:Topic subSections="3">SAX Readers</JavaXML:Topic>
<JavaXML:Topic subSections="9">Content Handlers</JavaXML:Topic>
<JavaXML:Topic subSections="4">Error Handlers</JavaXML:Topic>
<JavaXML:Topic subSections="0">
A Better Way to Load a Parser
</JavaXML:Topic>
<JavaXML:Topic subSections="4">"Gotcha!"</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:SectionBreak/>
<JavaXML:Chapter focus="Java">
<JavaXML:Heading>Web Publishing Frameworks</JavaXML:Heading>
<JavaXML:Topic subSections="4">Selecting a Framework</JavaXML:Topic>
<JavaXML:Topic subSections="4">Installation</JavaXML:Topic>
<JavaXML:Topic subSections="3">
Using a Publishing Framework
</JavaXML:Topic>
<JavaXML:Topic subSections="2">XSP</JavaXML:Topic>
<JavaXML:Topic subSections="3">Cocoon 2.0 and Beyond</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
</JavaXML:Contents>
<JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>
</JavaXML:Book>JavaXML:Book element. These initial lines,
excluding the JavaXML:Book element, make up the
document header. The term "header" is not a formal term
defined in the XML specification, but is commonly used in the XML
community, and we will use it in this book to denote these initial
lines of an XML document.xml are intended for the XML parser itself. They
specify the version of XML being used, a stylesheet, or other
information that a parser may need to know to properly parse XML
data. Here is an XML instruction:<?xml version="1.0" standalone="no"?>
<?target
instruction?>, and in this case it specifies
that XML Version 1.0 is being used and that the document is not a
standalone XML document. Notice that the
instruction is not necessarily a single
keyword=value pair; in this case, both the version and whether the
document needs to be paired with an external document or documents
are specified. By specifying that it is not a standalone document, a
parser knows that an external DTD must be used to determine if the
XML document is valid. If this were set to <JavaXML:Book>:<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/" > <!-- Content of XML Document --> </JavaXML:Book>
JavaXML.
In our XML example, it may be necessary later to include portions of
other O'Reilly books. Because each of these books may also have
<Chapter>,
<Heading>, or
<Topic> tags, the document must be designed
and constructed in a way to avoid namespace collision problems with
other documents. The XML namespaces specification nicely solves this
problem. Because our XML document represents a specific book, and no
other XML document should represent the same book, using a prefix
like http://xml.apache.org, this C- and Java-based
parser is already one of the most widely contributed-to parsers
available. In addition, using an open source parser such as Xerces
allows you to send questions or bug reports to the parser's
authors, resulting in a better product, as well as helping you use
the software quickly and correctly. To subscribe to the general list
and request help on the Xerces parser, send a blank email to
org.xml.sax.XMLReader
interface. This interface defines parsing behavior and allows us to
set features and properties, which we will look at in Chapter 5. For those of you familiar with SAX 1.0, this
interface replaces the org.xml.sax.Parser
interface.org.apache.xerces.parsers.SAXParser, implements
the org.xml.sax.XMLReader interface. If you have
access to the source of your parser, you should see the same
interface implemented in your parser's main SAX parser class.
Each XML parser must have one class (sometimes more!) that implements
this interface, and that is the class we need to instantiate to allow
us to parse XML:XMLReader parser = new SAXParser( ); // Do something with the parser parser.parse(uri);
reader
or XMLReader. While that would be a normal
convention, the SAX 1.0 classes defined the main parsing interface as
Parser, and a lot of legacy code has variables
named parser because of that naming. This
interface was deprecated because of the large number of changes
required for namespace and feature and properties support, but the
naming convention is still a good one, as parser
does indicate the purpose of the instance variable.org.xml.sax.ContentHandler
,
org.xml.sax.ErrorHandler,
org.xml.sax.DTDHandler, and
org.xml.sax.EntityResolver. In this chapter, we
discuss ContentHandler, which allows standard
data-related events within an XML document to be handled, and take a
first look at ErrorHandler, which receives
notifications from the parser when errors in the XML data are found.
DTDHandler will be examined in Chapter 5. We briefly discuss
EntityResolver at various points in the text; it
is enough for now to understand that
EntityResolver works just like the other handlers,
and is built specifically for resolving external entities specified
within an XML document. Custom application classes that perform
specific actions within the parsing process can implement each of
these interfaces. These implementation classes can be registered with
the parser with the methods setContentHandler( ),
setErrorHandler( ), setDTDHandler(
), and setEntityResolver( ). Then the
parser invokes the callback methods on the appropriate handlers
during parsing.ContentHandler interface. This interface defines
several important methods within the parsing lifecycle that our
application can react to. First we need to add the appropriate
import
statements to our source file
(including the ContentHandler interface for handling parsing
events, SAX provides an
ErrorHandler
interface that can be
implemented to treat various error conditions that may arise during
parsing. This class works in the same manner as the document handler
we have already constructed, but only defines three callback methods.
Through these three methods, all possible error conditions are
handled and reported by SAX parsers.SAXParseException
.
This object holds the line number that trouble was encountered on,
the URI of the document being treated, which could be the parsed
document or an external reference within that document, and normal
exception details such as a message and a printable stack trace. In
addition, each method can throw a SAXException.
This may seem a bit odd at first; an exception handler that throws an
exception? Keep in mind that what each handler receives is a parsing
exception. This can be a warning that should not cause the parsing
process to stop or an error that needs to be resolved for parsing to
continue; however, the callback may need to perform system I/O or
another operation that can throw an exception, and it needs to be
able to bubble this exception up the application chain. It can do
this through the SAXException the method is
allowed to throw.XMLReader:try {
// Instantiate a parser
XMLReader parser =
new SAXParser( );
// Register the content handler
parser.setContentHandler(contentHandler);
// Register the error handler
parser.setErrorHandler(errorHandler);
// Parse the document
parser.parse(uri);
} catch (IOException e) {
System.out.println("Error reading URI: " + e.getMessage( ));
} catch (SAXException e) {
System.out.println("Error in parsing: " + e.getMessage( ));
}
// Import your vendor's XMLReader implementation here import org.apache.xerces.parsers.SAXParser;
XMLReader implementation, and then instantiate
that implementation directly. The problem here is not the difficulty
of this task, but that we have broken one of Java's biggest
tenets: portability. Our code cannot run or even be compiled on a
platform that does not use the Apache Xerces parser. In fact, it is
conceivable that an updated version of Xerces might even change the
name of the class used here! Our "portable" Java code is
no longer very portable.String parameter to be changed in your source
code. Luckily, this facility is available in SAX 2.0. The
org.xml.sax.helpers.XMLReaderFactory class
provides the method you should be looking for:/**
* Attempt to create an XML reader from a class name.
*
* <p>Given a class name, this method attempts to load
* and instantiate the class as an XML reader.</p>
*
* @return A new XML reader.
* @exception org.xml.sax.SAXException If the class cannot be
* loaded, instantiated, and cast to XMLReader.
* @see #createXMLReader( )
*/
public static XMLReader createXMLReader (String className)
throws SAXException {
// Implementation
}org.xml.sax.helpers.ParserAdapter
, which can actually cause a SAX 1.0
Parser implementation to behave like a SAX 2.0
XMLReader implementation. This handy class takes
in a 1.0 Parser implementation as an input
parameter and then can be used in the stead of that implementation.
It allows a ContentHandler to be set, and handles
all namespace callbacks properly. The only feature loss you will see
is that skipped entities will not be reported, as
this capability was not available in a 1.0 implementation in any
form, and cannot be emulated by a 2.0 adapter class. The sample class
would be used as shown in Example 3.6.try {
// Register a parser with SAX
Parser parser =
ParserFactory.makeParser(
"org.apache.xerces.parsers.SAXParser");
ParserAdapter myParser = new ParserAdapter(parser);
// Register the document handler
myParser.setContentHandler(contentHandler);
// Register the error handler
myParser.setErrorHandler(errHandler);
// Parse the document
myParser.parse(uri);
} catch (ClassNotFoundException e) {
System.out.println(
"The parser class could not be found.");
} catch (IllegalAccessException e) {
System.out.println(
"Insufficient privileges to load the parser class.");
} catch (InstantiationException e) {
System.out.println(
"The parser class could not be instantiated.");
} catch (ClassCastException e) {
System.out.println(
"The parser does not implement org.xml.sax.Parser");
} catch (IOException e) {
System.out.println("Error reaading URI: " + e.getMessage( ));
} catch (SAXException e) {
System.out.println("Error in parsing: " + e.getMessage( ));
}<?xml version="1.0"?>
<?xml-stylesheet href="XSL\JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL\JavaXML.wml.xsl" type="text/xsl"
media="wap"?>
<?cocoon-process type="xslt"?>
<!DOCTYPE JavaXML:Book SYSTEM "DTD\JavaXML.dtd">
<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
<JavaXML:Title>Java and XML</JavaXML:Title>
<JavaXML:Contents>
<JavaXML:Chapter focus="XML">
<JavaXML:Heading>Introduction</JavaXML:Heading>
<JavaXML:Topic subSections="7">What Is It?</JavaXML:Topic>
<JavaXML:Topic subSections="3">How Do I Use It?</JavaXML:Topic>
<JavaXML:Topic subSections="4">Why Should I Use It?</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:Chapter focus="XML">
<JavaXML:Heading>Creating XML</JavaXML:Heading>
<JavaXML:Topic subSections="0">An XML Document</JavaXML:Topic>
<JavaXML:Topic subSections="2">The Header</JavaXML:Topic>
<JavaXML:Topic subSections="6">The Content</JavaXML:Topic>
<JavaXML:Topic subSections="1">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:Chapter focus="Java">
<JavaXML:Heading>Parsing XML</JavaXML:Heading>
<JavaXML:Topic subSections="3">Getting Prepared</JavaXML:Topic>
<JavaXML:Topic subSections="3">SAX Readers</JavaXML:Topic>
<JavaXML:Topic subSections="9">Content Handlers</JavaXML:Topic>
<JavaXML:Topic subSections="4">Error Handlers</JavaXML:Topic>
<JavaXML:Topic subSections="0">
A Better Way to Load a Parser
</JavaXML:Topic>
<JavaXML:Topic subSections="4">"Gotcha!"</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
<JavaXML:SectionBreak/>
<JavaXML:Chapter focus="Java">
<JavaXML:Heading>Web Publishing Frameworks</JavaXML:Heading>
<JavaXML:Topic subSections="4">Selecting a Framework</JavaXML:Topic>
<JavaXML:Topic subSections="4">Installation</JavaXML:Topic>
<JavaXML:Topic subSections="3">
Using a Publishing Framework
</JavaXML:Topic>
<JavaXML:Topic subSections="2">XSP</JavaXML:Topic>
<JavaXML:Topic subSections="3">Cocoon 2.0 and Beyond</JavaXML:Topic>
<JavaXML:Topic subSections="0">What's Next?</JavaXML:Topic>
</JavaXML:Chapter>
</JavaXML:Contents>
<JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>
</JavaXML:Book>http://www.w3.org/TR/xmlschema-1/
andhttp://www.w3.org/TR/xmlschema-2/. You should also be
aware that many XML
parsers do not
support XML Schema, or support only portions of the specification.
You should check with your vendor to verify the level of XML Schema
support provided by your XML parser.DOCTYPE declaration
is considered a valid XML document. This has caused quite a bit of
confusion in the XML community as to how to handle schema validation.
In addition to the difference in terms of validity, an XML 1.0 parser
or application does not have to perform schema validation, again
because XML Schema is not in the 1.0 specification of XML. This means
that even if your document has a schema reference, the document may
not be validated against that schema, regardless of the
parser's level of schema support. For these reasons, you should
take care to determine when your
parser will and will not validate, and
specifically how it handles schema validation. For clarity, we will
continue to use validity as the single term, representing either
schema or DTD validity. It will be up to you to see whether a
XMLReader
interface, the methods for setting
document and schema validation, namespace support, and other core
features are not standard across parser implementations. To address
this, SAX 2.0 defines a standard mechanism for setting important
properties and features of a parser that allows the addition of new
properties and features as they are accepted by the W3C without the
use of proprietary extensions or methods.XMLReader
interface, the methods for setting
document and schema validation, namespace support, and other core
features are not standard across parser implementations. To address
this, SAX 2.0 defines a standard mechanism for setting important
properties and features of a parser that allows the addition of new
properties and features as they are accepted by the W3C without the
use of proprietary extensions or methods.XMLReader
interface. This means we have to change little of our existing code
to request validation, set the namespace separator, and handle other
feature and property requests. The methods used for these purposes
are outlined in Table 5.1.|
Method
|
Returns
|
Parameters
|
Syntax
|
|---|---|---|---|
setProperty( )
|
void |
String propertyID, Object value |
parser.setProperty(
"[Property URI]",
"[Object parameter]");
|
setFeature( ) |
D:\prod\JavaXML> java SAXParserDemo D:\prod\JavaXML\contents\contents.xml
Parsing XML File: D:\prod\JavaXML\contents\contents.xml
* setDocumentLocator( ) called
Parsing begins...
**Parsing Error**
Line: 13
URI: file:/D:/prod/JavaXML/contents/contents.xml
Message: Document root element "JavaXML:Book", must match DOCTYPE root
"JavaXML:Book".
DOCTYPE declaration
(JavaXML:Book) does not match the root element of
the document itself. But the root element is
JavaXML:Book, right? Actually, it's not! By
default,
SAX 2.0 specifies that parsers must enable
their namespace feature, making all SAX 2.0 parsers namespace-aware
unless this feature is explicitly set to false. We did not change this default, so our
XMLReader implementation is namespace aware. The
unexpected result of this is that our root element is seen (by the
parser) as Book, with the namespace prefix of
JavaXML. But remember that XML 1.0 and DTDs cannot
distinguish between a prefix and element name, so the root element
the DTD expects to find is JavaXML:Book. When it
finds Book, it reports the error above.SAXParserDemo source file:try {
// Instantiate a parser
XMLReader parser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
// Register the content handler
parser.setContentHandler(contentHandler);
// Register the error handler
parser.setErrorHandler(errorHandler);
// Turn on validation
parser.setFeature("http://xml.org/sax/features/validation",
true);
ContentHandler and ErrorHandler
interfaces that we looked at in Chapter 3,
defining two callback methods that occur during the parsing process.DTDHandler
interface unless you are writing an XML editor or IDE and need to
build or process DTD documents for correct syntax and notation. We
will look at the two callback methods provided by SAX here, but will
not spend much time on their use, as they are not significant in our
use of XML for non-editor type applications. For information on an
optional SAX handler that can help in reading further DTD
information, refer to the DeclHandler interface in
Appendix A, under the
org.xml.sax.ext package.unparsedEntityDecl(
)
, is invoked when a DTD has an entity
declaration signifying that the XML parser should not parse a
particular entity. Though we have not looked at an example of this,
unparsed
entities
are common in XML documents that reference
images or other binary data, such as
media files. This method takes in the name
of the entity, the public and system IDs, and the notation
name
of the entity. Notation names are another
XML term we have not yet looked at. Consider the example of an XML
document fragment that refers to an image, possibly representing a
logo, shown in Example 5.9.DTDHandler implementation with
the XML parser. Often, time and effort are spent to implement the
DTDHandler interface and register it with the
parser, and time is not spent setting the validation feature of the
parser. This mistake arises from a
mistaken association between handling a DTD and actually using the
DTD for validation. In this case, the DTD would be parsed, and all
DTD callback events would occur (if any were needed). However, the
XML document itself would not be validated, but simply parsed. Keep
in mind that the output from parsing a valid XML document looks
almost identical to output from a non-validated XML document; always
be aware when validation is occurring to avoid application bugs:try {
// Instantiate a parser
XMLReader parser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
// Register the content handler
parser.setContentHandler(contentHandler);
// Register the error handler
parser.setErrorHandler(errorHandler);
// This has no effect on turning on validation!
parser.setDTDHandler(dtdHandler);
// Turn on validation
parser.setFeature("http://xml.org/sax/features/validation", true);
// Turn off namespace awareness
parser.setFeature("http://xml.org/sax/features/namespaces", false);
// Parse the document
parser.parse(uri);
} catch (IOException e) {
System.out.println("Error reading URI: " + e.getMessage( ));
} catch (SAXException e) {
System.out.println("Error in parsing: " + e.getMessage( ));
}