The last chapter was a good introduction to SAX. However, there are several more topics that will round out your knowledge of SAX. While I’ve called this chapter “Advanced SAX,” don’t be intimidated. It could just as easily be called “Less-Used Portions of SAX that are Still Important.” In writing these two chapters, I followed the 80/20 principle. 80% of you will probably never need to use the material in this chapter, and Chapter 3 will completely cover your needs. However, for those power users out there working in XML day in and day out, this chapter covers some of the finer points of SAX that you’ll need.
I’ll start with a look at setting parser properties and
features, and discuss configuring your parser to do whatever you need
it to. From there, I’ll move on to some more handlers: the
EntityResolver
and DTDHandler
left over from the last chapter. At that point, you should have a
comprehensive understanding of the standard SAX 2.0 distribution.
However, we’ll push on to look at some SAX extensions,
beginning with the writers that can be coupled with SAX, as well as
some filtering mechanisms. Finally, I’ll introduce some new
handlers to you, the LexicalHandler
and
DeclHandler
, and show you how they are used. When
all is said and done (including another “Gotcha!”
section), you should be ready to take on the world with just your
parser and the SAX classes. So slip into your shiny spacesuit and
grab the flightstick—ahem. Well, I got carried away with the
taking on the world. In any case, let’s get down to it.
With the wealth of XML-related
specifications and technologies emerging from the World Wide Web
Consortium (W3C), adding support for any new feature or property of
an XML parser has become difficult. Many parser implementations have
added proprietary extensions or methods at the cost of code
portability. While these software packages may implement the SAX
XMLReader
interface, the methods for setting
document and schema validation, namespace support, and other core
features are not standard across parser implementations. To address
this, SAX 2.0 defines a
standard mechanism for setting important properties and features of a
parser that allows the addition of new properties and features as
they are accepted by the W3C without the use of proprietary
extensions or methods.
Lucky for you and me, SAX 2.0 includes the methods needed for setting
properties and features in the
XMLReader
interface. This means you have to change little of your existing code
to request validation, set the namespace separator, and handle other
feature and property requests. The methods used for these purposes
are outlined in Table 4-1.
Table 4-1. Property and feature methods
Method |
Returns |
Parameters |
Syntax |
---|---|---|---|
|
|
| |
|
|
| |
|
|
| |
|
|
|
For these methods, the ID of a specific property or feature is a URI. The core set of features and properties is listed in Appendix B. Additional documentation on features and properties supported by your vendor’s XML parser should also be available. These URIs are similar to namespace URIs; they are only used as associations for particular features. Good parsers ensure that you do not need network access to resolve these features; think of them as simple constants that happen to be in URI form. These methods are simply invoked and the URI is dereferenced locally, often to constantly represent what action in the parser needs to be taken.
Warning
Don’t type these property and feature URIs into a browser to “check for their existence.” Often, this results in a 404NotFound error. I’ve had many browsers report this to me, insisting that the URIs are invalid. However, this is not the case; the URI is just an identifier, and as I pointed out, usually resolved locally. Trust me: just use the URI, and trust the parser to do the right thing.
In the parser configuration context, a
property
requires some object value to be usable. For example, for lexical
handling, a DOM Node
implementation would be
supplied as the value for the appropriate property. In contrast, a
feature
is a flag used by the parser to indicate whether a certain type of
processing should occur. Common features are validation, namespace
support, and including external parameter entities.
The most convenient aspect of these methods is that they allow simple addition and modification of features. Although new or updated features will require a parser implementation to add supporting code, the method by which features and properties are accessed remains standard and simple; only a new URI need be defined. Regardless of the complexity (or obscurity) of new XML-related ideas, this robust set of four methods should be sufficient to allow parsers to implement the new ideas.
More often than not, the features and properties you deal with are the standard SAX-defined ones. These are features and properties that should be available with any SAX distribution, and that any SAX-compliant parser should support. Additionally, this preserves vendor-independence in your code, so I recommend that you use SAX-defined properties and features whenever possible.
The most common feature you’ll use is the validation feature. The URI for this guy is http://xml.org/sax/features/validation, and not surprisingly, it turns validation on or off in the parser. For example, if you want to turn on validation in the parsing example from the last chapter (remember the Swing viewer?), make this change in the SAXTreeViewer.java source file:
public void buildTree(DefaultTreeModel treeModel, DefaultMutableTreeNode base, String xmlURI) throws IOException, SAXException { // Create instances needed for parsing XMLReader reader = XMLReaderFactory.createXMLReader(vendorParserClass); ContentHandler jTreeContentHandler = new JTreeContentHandler(treeModel, base); ErrorHandler jTreeErrorHandler = new JTreeErrorHandler( ); // Register content handler reader.setContentHandler(jTreeContentHandler); // Register error handler reader.setErrorHandler(jTreeErrorHandler);// Request validation
reader.setFeature("http://xml.org/sax/features/validation", true);
// Parse InputSource inputSource = new InputSource(xmlURI); reader.parse(inputSource); }
Compile these changes, and run the example program. Nothing happens,
right? Not surprising; the XML we’ve looked at so far is all
valid with respect to the DTD supplied. However, it’s easy
enough to fix that. Make the following change to your XML file
(notice that the element in the DOCTYPE
declaration no longer matches the actual root element, since XML is
case-sensitive):
<?xml version="1.0"?>
<!DOCTYPE Book SYSTEM "DTD/JavaXML.dtd">
<!-- Java and XML Contents -->
<book xmlns="http://www.oreilly.com/javaxml2"
xmlns:ora="http://www.oreilly.com"
>
Now run your program on this modified document. Because validation is turned on, you should get an ugly stack trace reporting the error. Of course, because that’s all that our error handler methods do, this is precisely what we want:
C:\javaxml2\build>java javaxml2.SAXTreeViewer
c:\javaxml2\ch04\xml\contents.xml
**Parsing Error**
Line: 7
URI: file:///c:/javaxml2/ch04/xml/contents.xml
Message: Document root element "book", must match DOCTYPE root "Book".
org.xml.sax.SAXException: Error encountered
at javaxml2.JTreeErrorHandler.error(SAXTreeViewer.java:445)
[Nasty Stack Trace to Follow...]
Remember, turning validation on or off does not affect DTD
processing; I talked about this in the last chapter, and wanted to
remind you of this subtle fact. To get a better sense of this, turn
off validation (comment out the feature setting, or supply it the
“false” value), and run the program on the modified XML.
Even though the DTD is processed, as seen by the resolved
OReillyCopyright
entity reference, no errors occur. That’s the
difference between processing a
DTD and
validating an XML document against that DTD.
Memorize, understand, and recite this to yourself; it will save you
hours of confusion in the long run.
Next to validation, you’ll most commonly deal with namespaces. There are two features related to namespaces: one that turns namespace processing on or off, and one that indicates whether namespace prefixes should be reported as attributes. The two are essentially tied together, and you should always “toggle” both, as shown in Table 4-2.
Table 4-2. Toggle values for namespace-related features
Value for namespace processing |
Value for namespace prefix reporting |
---|---|
True |
False |
False |
True |
This should make sense: if namespace processing is on, the xmlns-style declarations on elements should not be exposed to your application as attributes, as they are only useful for namespace handling. However, if you do not want namespace processing to occur (or want to handle it on your own), you will want these xmlns declarations reported as attributes so you can use them just as you would use other attributes. However, if these two fall out of sync (both are true, or both are false), you can end up with quite a mess!
Consider writing a small utility method to ensure these two features stay in sync with each other. I often use the method shown here for this purpose:
private void setNamespaceProcessing(XMLReader reader, boolean state) throws SAXNotSupportedException, SAXNotRecognizedException { reader.setFeature( "http://xml.org/sax/features/namespaces", state); reader.setFeature( "http://xml.org/sax/features/namespace-prefixes", !state); }
This maintains the correct setting for both features, and you can now
simply call this method instead of two setFeature( )
invocations in your own code. Personally, I’ve used this
feature less than ten times in about two years; the default values
(processing namespaces as well as not reporting prefixes as
attributes) almost always work for me. Unless you are writing
low-level applications that either don’t need namespaces or can
use the speed increase obtained from not processing namespaces, or
you need to handle namespaces on your own, I wouldn’t worry too
much about either of these features.
This code brings up a rather important aspect of features and
properties, though: invoking the feature and property methods can
result in
SAXNotSupportedException
s and
SAXNotRecognizedException
s.
These are both in the org.xml.sax
package,
and need to be imported in any SAX code that uses them. The first
indicates that the parser knows about the feature or property but
doesn’t support it. You won’t run into this much in even
average quality parsers, but it is commonly used when a standard
property or feature is not yet coded in. So invoking
setFeature( )
on the namespace processing feature
on a parser in development might result in a
SAXNotSupportedException
. The parser recognizes
the feature, but doesn’t have the ability to perform the
requested processing. The second exception most commonly occurs when
using vendor-specific features and properties (covered in the next
section), and then switching parser implementations. The new
implementation won’t know anything about the other
vendor’s features or properties, and will throw a
SAXNotRecognizedException
.
You should always explicitly catch these exceptions so you can deal with them. Otherwise, you end up losing valuable information about what happened in your code. For example, let me show you a modified version of the code from the last chapter that tries to set up various features, and how that changes the exception-handling architecture:
public void buildTree(DefaultTreeModel treeModel, DefaultMutableTreeNode base, String xmlURI) throws IOException, SAXException {String featureURI = "";
try {
// Create instances needed for parsing XMLReader reader = XMLReaderFactory.createXMLReader(vendorParserClass); ContentHandler jTreeContentHandler = new JTreeContentHandler(treeModel, base); ErrorHandler jTreeErrorHandler = new JTreeErrorHandler( ); // Register content handler reader.setContentHandler(jTreeContentHandler); // Register error handler reader.setErrorHandler(jTreeErrorHandler);/** Deal with features **/
featureURI = "http://xml.org/sax/features/validation";
// Request validation
reader.setFeature(featureURI, true);
// Namespace processing - on
featureURI = "http://xml.org/sax/features/namespaces";
setNamespaceProcessing(reader, true);
// Turn on String interning
featureURI = "http://xml.org/sax/features/string-interning";
reader.setFeature(featureURI, true);
// Turn off schema processing
featureURI =
"http://apache.org/xml/features/validation/schema";
reader.setFeature(featureURI, false);
// Parse InputSource inputSource = new InputSource(xmlURI); reader.parse(inputSource);} catch (SAXNotRecognizedException e) {
System.out.println("The parser class " + vendorParserClass +
" does not recognize the feature URI " + featureURI);
System.exit(0);
} catch (SAXNotSupportedException e) {
System.out.println("The parser class " + vendorParserClass +
" does not support the feature URI " + featureURI);
System.exit(0);
}
}
By dealing with these exceptions as well as other special cases, you give the user better information and improve the quality of your code.
The three
remaining SAX-defined features are fairly obscure. The first,
http://xml.org/sax/features/string-interning,
turns string interning on or off. By default this is false (off) in
most parsers. Setting it to true means that every element name,
attribute name, namespace URI and prefix, and other strings have
java.lang.String.intern(
) invoked on them.
I’m not going to get into great detail about interning here; if
you don’t know what it is, check out Sun’s
Javadoc on the method at
http://java.sun.com/j2se/1.3/docs/api/index.html.
In a nutshell, every time a string is encountered, Java attempts to
return an existing reference for the string in the current string
pool, instead of (possibly) creating a new String
object. Sounds like a good thing, right? Well, the reason it’s
off by default is most parsers have their own optimizations in place
that can outperform string interning. My advice is to leave this
setting alone; many people have spent weeks tuning things like this
so you don’t have to mess with them.
The other two features determine whether textual entities are expanded and resolved (http://xml.org/sax/features/external-general-entities), and whether parameter entities are included (http://xml.org/sax/features/external-parameter-entities) when parsing occurs. These are set to true for most parsers, as they deal with all the entities that XML has to offer. Again, I recommend you leave these settings as is, unless you have a specific reason for disabling entity handling.
The two standard SAX properties are a
little less clear in their usage. In both cases, the properties are
more useful for obtaining values, whereas with
features the common use is to set values.
Additionally, both properties are more helpful in error handling than
in any general usage. And finally, both properties provide access to
what is being parsed at a given time. The first, identified by the
URI http://xml.org/sax/properties/dom-node,
returns the current DOM node being processed, or the root DOM node if
parsing isn’t occurring. Of course, I haven’t really
talked about DOM yet, but this will make more sense in the next two
chapters. The second property, identified by the URI http://xml.org/sax/properties/xml-string,
returns the literal string of characters being processed.
You’ll find varying support for these properties in various
parsers, showing that many parser implementers find these properties
of arguable use as well. For example, Xerces does not support the
xml-string
property, to avoid having to buffer the
input document (at least in that specific way). On the other hand, it
does support the dom-node
property so that you can
turn a SAX parser into (essentially) a DOM tree iterator.
In addition to the standard, SAX-defined features and properties, most parsers define several features and properties of their own. For example, Apache Xerces has a page of features it supports at http://xml.apache.org/xerces-j/properties.html,and properties it supports at http://xml.apache.org/xerces-j/properties.html. I’m not going to cover these in great detail, and you should steer clear of them whenever possible; it locks your code into a specific vendor. However, there are times when using a vendor’s specific functionality will save you some work. In those cases, exercise caution, but don’t be foolish; use what your parser gives you!
As an example, take the Xerces feature that enables and disables XML schema processing: http://apache.org/xml/features/validation/schema. Because there is no standard support for XML schemas across parsers or in SAX, use this specific feature (it’s set to true by default) to avoid spending parsing time to deal with any referenced XML schemas in your documents, for example. You save time in production if you don’t use this processing, and it needs a vendor-specific feature. Check out your vendor documentation for options available in addition to SAX’s.
Get Java and XML, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.