SAX gives you the flexibility to approach application design with your own trade-offs and goals in mind. High-level APIs often make many of those trade-offs for you, but not necessarily in ways that are best for your problems. In particular, SAX lets you design lightweight, task-oriented XML solutions, which can fit into small systems or scale up to large ones. Just having such options can be an important reason to choose SAX over generic APIs that work only at a high level. While initial deployment platforms might be richly featured, this won’t necessarily be true for all the systems you need to support, or for the ones your customers want you to support.
Compared to other parser-level APIs, SAX has two unique structural features: its efficient event-stream processing model and its data structure flexibility. These give you more control over the results of your parse.
SAX is the API to use when you need to stream-process XML to conserve memory and, in most cases, CPU time. In SAX, handler interfaces call application (or library) code for each significant chunk of XML information as it’s parsed. These chunks include character data, elements, and attributes. Each event passes information to your code, which can save it or ignore it as appropriate. These handlers see document information as a stream of such event calls, in “document order.” Applications can process data incrementally, rather than in one big chunk, and they can discard information as soon as it’s not needed.
SAX parsers have several key advantages:
SAX parsers can be small and fast because they are minimal. SAX provides the most essential XML data, and no more.
SAX parsers are well suited for use in resource-constrained environments. This includes not just small systems or classic embedded ones (where cost prevents using of much memory or fast CPUs), but also inside servers (which may have huge amounts of memory and fast CPUs, but need good scaling properties to share them with many clients) such as security gateways. Good security practice avoids large bodies of code, since assurance is so hard to achieve.
Because SAX is a streaming API, it promotes pipelined processing, where I/O occurs while you use the CPU to do work. You will naturally structure applications (or at least their SAX components) to use efficient single-pass algorithms and incremental processing.
As soon as XML data starts to become available (perhaps over a network), SAX parsers start to provide it to applications. While processing element or character data, the network or the filesystem prefetches the next data. Such overlapped processing lowers latencies and makes good use of limited CPU cycles. With most other APIs, your application won’t even see data until the whole document has been fetched and parsed; you can’t process documents larger than available memory. This causes major trouble when you work with large documents, as discussed in the next section.
SAX gives you flexible control over how you handle errors and faults. Fatal errors aren’t the only kind of reportable fault, and diagnostic information is readily accessible.
You easily provide application-specific error reports with the standard mechanism. It’s also easy to terminate parsing early: just throw an appropriate exception when you find the
great:widget
element you need, or when some unrecoverable error turns up.It’s easy to define custom SAX event producers.
That is, you can use SAX when your inputs aren’t literal XML text. This is a powerful technique that helps you work with data at the level of parsed XML information (the XML “Infoset”), and postprocess SAX events or late-bind data into XML text format. Such early/late-binding flexibility is a powerful architectural tool.
You may be fortunate enough to be able to design the XML representations of your application tasks to facilitate such work-flow streams. When you do this, you may see substantial performance and scalability gains over alternative design approaches. You might even be able to pull the SAX event stream model up into higher-level work flows in your system so that more processing can be stream-based.
For example, you could structure your XML as a sequential list of reasonably sized tasks. Several kinds of data import/export problems are well suited to this approach, although you may find you need to be aware of the I/O costs of random access as you transform data to and from interchange formats.
In contrast to higher-level APIs, or most design tools, SAX allows you to populate whatever data structures you choose. It lets you use custom data APIs, optimized for your application, or more general-purpose APIs. This flexibility operates at two broad system levels: architecture and design. Suchs flexibility is required to scale applications up (or down) and to update applications as systems evolve.
Application architecture components affect how systems interact with each other and with external systems. SAX doesn’t constrain these components, which include data interchange formats and messaging paradigms, because it lets you use XML in any way you (or your systems partners) need. In contrast, settling early on higher-level XML APIs will constrain application architectures in many ways, often affecting XML structures used for interoperability. For example, many SOAP toolkits expect an RPC paradigm using W3C-style XML schemas, and many data-binding approaches demand a particular schema system and API toolset. The hope is that if you accept those system constraints, you win more than they cost. When that doesn’t work, perhaps because the constraints don’t suit your application, you’ll appreciate the flexibility of SAX.
The design level affects application internals rather than the broader interfaces, which relate to architecture. Design constraints affect runtime and implementation costs. If you’re adding XML support to an existing system, design-level concerns may dominate your planning. SAX lets you use your current optimized data structures or define new ones. Since such design issues will often dominate performance measurements (given reasonable architectures), preserving flexibility can be very important.
With SAX, you don’t need to use generic (and largely untyped) data structures. You will normally store data directly into specialized data structures as SAX delivers it from its XML representation. This facilitates important architecture-level optimizations. Being able to use custom data structures means you can leverage the strong data-typing facilities in Java and detect many kinds of bugs early, while recovery is possible and cheap. Custom data decisions are the ideal way to work with large documents, for other cases where scale is a major concern, and anywhere that data structure decisions need to be driven by application issues rather than “one size fits all” generic tools.
To illustrate this design impact, we’ll pick on DOM as a representative design choice for an API with a generic XML data structure. You’ll often have reasons to use both SAX and DOM, even in the same application, so you’ll need to know when to use each API. The strength of DOM is that it’s a widely understood and available generic model; it can be good for “proof of concept” solutions. However, it has a high price in terms of flexibility and resource consumption. Later, we look at ways to reduce those DOM costs with help from SAX and ways that DOM and SAX representations of XML data can be interconverted.
For documents with a “typical” markup density, many DOM implementations in Java use about 10 bytes of memory to represent each byte of XML text. (Few take less, some take more.) Yes, that midsize three-megabyte document can easily balloon up to 30 megabytes of memory on your server![1] When using DOM with large documents, memory shortages are common, both for virtual memory and for space in the Java heap. Shortages are made worse if you then need to convert data from a generic DOM representation into custom structures, because you need an extra copy of the data while you build the more appropriate data structure. This clearly limits application scalability.
On the other hand, with SAX you don’t pay for any memory unless you choose to do so. You can ignore most of that three-megabyte document right up front; the API structure makes it natural to capture only significant data (whatever that may be in your application). This reduces memory allocation pressure, as well as overhead from garbage collection. Best, SAX parsers let you use data structures that are appropriate for your application from the very beginning. In fact, they all but require you to do that!
SAX has always defined its concurrency behaviors, making it safe to use SAX in multithreaded applications. Since DOM does not specify those behaviors, multithreaded applications (such as most web services) accept implementation dependencies if they choose to use DOM.
SAX2 provides almost complete support for the XML Infoset, exposing the logical structure of XML data. (See Appendix B.) This means it’s substantially more complete than most other XML APIs, and certainly more complete than any other widely available API. You are unlikely to need important information from an XML document that SAX can’t provide. This contrasts with DOM, which doesn’t have standard APIs to expose much of this information. SAX is great way to turn a stream of such Infoset data into other kinds of data.
At its core, SAX is indeed a very simple API for XML processing; such simplicity is a key virtue. You can write useful XML applications code with only a handful of method calls and still know that the rest of the XML Infoset data is available when you need it. It’s not like DOM, in which syntax artifacts that mask the core data model of XML are common. DOM takes a more monolithic approach than SAX. A book that covers DOM as completely as this book covers SAX would need to be several times larger even if it didn’t cover the latest version (Level 3).
On top of that, because SAX makes you actually think about the best way to represent your data, it’s more fun to work with than tools that claim to solve those issues for you! (They usually can’t.) It’s also a great way to learn your way around XML and Java.
[1] Some applications certainly revolve around large documents. One translation of the Old Testament is over 3 megabytes in size; one dictionary is over 50 megabytes. Dumps of databases can be gigabytes in size.
Get SAX2 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.