Now that you’re all warmed up with parsers and have enough knowledge to make you slightly dangerous, we’ll analyze one of the two important styles of XML processing: event streams. We’ll look at some examples that show the basic theory of stream processing and graduate with a full treatment of the standard Simple API for XML (SAX).
In the world of computer science, a stream is a sequence of data chunks to be processed. A file, for example, is a sequence of characters (one or more bytes each, depending on the encoding). A program using this data can open a filehandle to the file, creating a character stream, and it can choose to read in data in chunks of whatever size it chooses. Streams can be dynamically generated too, whether from another program, received over a network, or typed in by a user. A stream is an abstraction, making the source of the data irrelevant for the purpose of processing.
To summarize, here are a stream’s important qualities:
It consists of a sequence of data fragments.
The order of fragments transmitted is significant.
The source of data (e.g., file or program output) is not important.
XML streams are just clumpy character streams. Each data clump, called a token in parser parlance, is a conglomeration of one or more characters. Each token corresponds to a type of markup, such as an element start or end tag, a string of character data, or a processing instruction. It’s very easy for parsers to dice up XML in this way, requiring minimal resources and time.
What makes XML streams different from character streams is that the context of each token matters; you can’t just pump out a stream of random tags and data and expect an XML processor to make sense of it. For example, a stream of ten start tags followed by no end tags is not very useful, and definitely not well-formed XML. Any data that isn’t well-formed will be rejected. After all, the whole purpose of XML is to package data in a way that guarantees the integrity of a document’s structure and labeling, right?
These contextual rules are helpful to the parser as well as the front-end processor. XML was designed to be very easy to parse, unlike other markup languages that can require look-ahead or look-behind. For example, SGML does not have a rule requiring nonempty elements to have an end tag. To know when an element ends requires sophisticated reasoning by the parser. This requirement leads to code complexity, slower processing speed, and increased memory usage.
Why do we call it an event stream and not an element stream or a markup object stream? The fact that XML is hierarchical (elements contain other elements) makes it impossible to package individual elements and serve them up as tokens in the stream. In a well-formed document, all elements are contained in one root element. A root element that contains the whole document is not a stream. Thus, we really can’t expect a stream to give a complete element in a token, unless it’s an empty element.
Instead, XML streams are composed of events. An event is a signal that the state of the document (as we’ve seen it so far in the stream) has changed. For example, when the parser comes across the start tag for an element, it indicates that another element was opened and the state of parsing has changed. An end tag affects the state by closing the most recently opened element. An XML processor can keep track of open elements in a stack data structure, pushing newly opened elements and popping off closed ones. At any given moment during parsing, the processor knows how deep it is in the document by the size of the stack.
Though parsers support a variety of events, there is a lot of overlap. For example, one parser may distinguish between a start tag and an empty element, while another may not, but all will signal the presence of that element. Let’s look more closely at how a parser might dole out tokens, as shown Example 4-1.
Example 4-1. XML fragment
<recipe> <name>peanut butter and jelly sandwich</name> <!-- add picture of sandwich here --> <ingredients> <ingredient>Gloppy™ brand peanut butter</ingredient> <ingredient>bread</ingredient> <ingredient>jelly</ingredient> </ingredients> <instructions> <step>Spread peanutbutter on one slice of bread.</step> <step>Spread jelly on the other slice of bread.</step> <step>Put bread slices together, with peanut butter and jelly touching.</step> </instructions> </recipe>
Apply a parser to the preceding example and it might generate this list of events:
A document start (if this is the beginning of a document and not a fragment)
A start tag for the
<recipe>
elementA start tag for the
<name>
elementThe piece of text “peanut butter and jelly sandwich”
An end tag for the
<name>
elementA comment with the text “add picture of sandwich here”
A start tag for the
<ingredients>
elementA start tag for the
<ingredient>
elementThe text “Gloppy”
A reference to the entity
trade
The text “brand peanut butter”
An end tag for the
<ingredient>
element
. . . and so on, until the final event—the end of the document—is reached.
Somewhere between chopping up a stream into tokens and processing the
tokens is a layer one might call a dispatcher. It branches the
processing depending on the type of token. The code that deals with a
particular token type is called a handler.
There could be a handler for start tags, another for character data,
and so on. It could be a compound if
statement,
switching to a subroutine to handle each case. Or, it could be built
into the parser as a callback dispatcher, as is the case with
XML::Parser
’s stream mode. If you
register a set of subroutines, one to an event type, the parser calls
the appropriate one for each token as it’s
generated. Which strategy you use depends on the parser.
You don’t have to write an XML processing program that separates parser from handler, but doing so can be advantageous. By making your program modular, you make it easier to organize and test your code. The ideal way to modularize is with objects, communicating on sanctioned channels and otherwise leaving one another alone. Modularization makes swapping one part for another easier, which is very important in XML processing.
The XML stream, as we said before, is an abstraction, which makes the source of data irrelevant. It’s like the spigot you have in the backyard, to which you can hook up a hose and water your lawn. It doesn’t matter where you plug it in, you just want the water. There’s nothing special about the hose either. As long as it doesn’t leak and it reaches where you want to go, you don’t care if it’s made of rubber or bark. Similarly, XML parsers have become a commodity: something you can download, plug in, and see it work as expected. Plugging it in easily, however, is the tricky part.
The key is the screwhead on the end of the spigot. It’s a standard gauge of pipe that uses a specific thread size, and any hose you buy should fit. With XML event streams, we also need a standard interface there. XML developers have settled on SAX, which has been in use for a few years now. Until recently, Perl XML parsers were not interchangeable. Each had its own interface, making it difficult to swap out one in favor of another. That’s changing now, as developers adopt SAX and agree on conventions for hooking up handlers to parsers. We’ll see some of the fruits of this effort in Chapter 5.
Stream processing is great for many XML tasks. Here are a few of them:
- Filter
A filter outputs an almost identical copy of the source document, with a few small changes. Every incidence of an
<A>
element might be converted into a<B>
element, for example. The handler is simple, as it has to output only what it receives, except to make a subtle change when it detects a specific event.- Selector
If you want a specific piece of information from a document, without the rest of the content, you can write a selector program. This program combs through events, looking for an element or attribute containing a particular bit of unique data called a key, and then stops. The final job of the program is to output the sought-after record, possibly reformatted.
- Summarizer
This program type consumes a document and spits out a short summary. For example, an accounting program might calculate a final balance from many transaction records; a program might generate a table of contents by outputting the titles of sections; an index generator might create a list of links to certain keywords highlighted in the text. The handler for this kind of program has to remember portions of the document to repackage it after the parser is finished reading the file.
- Converter
This sophisticated type of program turns your XML-formatted document into another format—possibly another application of XML. For example, turning DocBook XML into HTML can be done in this way. This kind of processing pushes stream processing to its limits.
XML stream processing works well for a wide variety of tasks, but it does have limitations. The biggest problem is that everything is driven by the parser, and the parser has a mind of its own. Your program has to take what it gets in the order given. It can’t say, “Hold on, I need to look at the token you gave me ten steps back” or “Could you give me a sneak peek at a token twenty steps down the line?” You can look back to the parsing past by giving your program a memory. Clever use of data structures can be used to remember recent events. However, if you need to look behind a lot, or look ahead even a little, you probably need to switch to a different strategy: tree processing, the topic of Chapter 6.
Now you have the grounding for XML stream processing. Let’s move on to specific examples and see how to wrangle with XML streams in real life.
In the Perl universe, standard APIs have been slow to catch on for many reasons. CPAN, the vast storehouse of publicly offered modules, grows organically, with no central authority to approve of a submission. Also, with XML, a relative newcomer on the data format scene, the Perl community has only begun to work out standard solutions.
We can characterize the first era of XML hacking in Perl to be the age of nonstandard parsers. It’s a time when documentation is scarce and modules are experimental. There is much creativity and innovation, and just as much idiosyncrasy and quirkiness. Surprisingly, many of the tools that first appeared on the horizon were quite useful. It’s fascinating territory for historians and developers alike.
XML::PYX
is one of these early parsers. Streams
naturally lend themselves to the concept of pipelines, where data
output from one program can be plugged into another, creating a chain
of processors. There’s no reason why XML
can’t be handled that way, so an innovative and
elegant processing style has evolved around this concept.
Essentially, the XML is repackaged as a stream of easily recognizable
and transmutable symbols, even as a command-line utility.
One example of this repackaging is PYX, a symbolic encoding of XML markup that is friendly to text processing languages like Perl. It presents each XML event on a separate line very cleverly. Many Unix programs like awk and grep are line oriented, so they work well with PYX. Lines are happy in Perl too.
Table 4-1 summarizes the notation of PYX.
For every event coming through the stream, PYX starts a new line, beginning with one of the five symbols shown in Table 4-1. This line is followed by the element name or whatever other data is pertinent. Special characters are escaped with a backslash, as you would see in Perl code.
Here’s how a parser converting an XML document into PYX notation would look. The following code is XML input by the parser:
<shoppinglist> <!-- brand is not important --> <item>toothpaste</item> <item>rocket engine</item> <item optional="yes">caviar</item> </shoppinglist>
As PYX, it would look like this:
(shoppinglist -\n (item -toothpaste )item -\n (item -rocket engine )item -\n (item Aoptional yes -caviar )item -\n )shoppinglist
Notice that the comment didn’t come through in the PYX translation. PYX is a little simplistic in some ways, omitting some details in the markup. It will not alert you to CDATA markup sections, although it will let the content pass through. Perhaps the most serious loss is character entity references that disappear from the stream. You should make sure you don’t need that information before working with PYX.
Matt Sergeant
has written a module, XML::PYX
, which parses XML
and translates it into PYX. The compact program in Example 4-2 strips out all XML element tags, leaving only
the character data.
Example 4-2. PYX parser
use XML::PYX; # initialize parser and generate PYX my $parser = XML::PYX::Parser->new; my $pyx; if (defined ( $ARGV[0] )) { $pyx = $parser->parsefile( $ARGV[0] ); } # filter out the tags foreach( split( / /, $pyx )) { print $' if( /^-/ ); }
PYX is an interesting alternative to SAX and DOM for quick-and-dirty XML processing. It’s useful for simple tasks like element counting, separating content from markup, and reporting simple events. However, it does lack sophistication, making it less attractive for complex processing.
Another early parser is XML::Parser
, the first fast and efficient parser to
hit CPAN. We detailed its many-faceted interface in Chapter 3. Its built-in stream mode is worth a closer
look, though. Let’s return to it now with a solid
stream example.
We’ll use XML::Parser
to read a
list of records encoded as an XML document. The records contain
contact information for people, including their names, street
addresses, and phone numbers. As the parser reads the file, our
handler will store the information in its own data structure for
later processing. Finally, when the parser is done, the program sorts
the records by the person’s name and outputs them as
an HTML table.
The source document is listed in Example 4-3. It has
a <list>
element as the root, with four
<entry>
elements inside it, each with an
address, a name, and a phone number.
Example 4-3. Address book file
<list> <entry> <name><first>Thadeus</first><last>Wrigley</last></name> <phone>716-505-9910</phone> <address> <street>105 Marsupial Court</street> <city>Fairport</city><state>NY</state><zip>14450</zip> </address> </entry> <entry> <name><first>Jill</first><last>Baxter</last></name> <address> <street>818 S. Rengstorff Avenue</street> <zip>94040</zip> <city>Mountainview</city><state>CA</state> </address> <phone>217-302-5455</phone> </entry> <entry> <name><last>Riccardo</last> <first>Preston</first></name> <address> <street>707 Foobah Drive</street> <city>Mudhut</city><state>OR</state><zip>32777</zip> </address> <phone>111-222-333</phone> </entry> <entry> <address> <street>10 Jiminy Lane</street> <city>Scrapheep</city><state>PA</state><zip>99001</zip> </address> <name><first>Benn</first><last>Salter</last></name> <phone>611-328-7578</phone> </entry> </list>
This simple structure lends itself naturally to event processing.
Each <entry>
start tag signals the
preparation of a new part of the data structure for storing data. An
</entry>
end tag indicates that all data for
the record has been collected and can be saved. Similarly, start and
end tags for <entry>
subelements are cues
that tell the handler when and where to save information. Each
<entry>
is self-contained, with no links to
the outside, making it easy to process.
The program is listed in Example 4-4. At the top is code used to initialize the parser object with references to subroutines, each of which will serve as the handler for a single event. This style of event handling is called a callback because you write the subroutine first, and the parser then calls it back when it needs it to handle an event.
After the initialization, we declare some global variables to store information from XML elements for later processing. These variables give the handlers a memory, as mentioned earlier. Storing information for later retrieval is often called saving state because it helps the handlers preserve the state of the parsing up to the current point in the document.
After reading in the data and applying the parser to it, the rest of the program defines the handler subroutines. We handle five events: the start and end of the document, the start and end of elements, and character data. Other events, such as comments, processing instructions, and document type declarations, will all be ignored.
Example 4-4. Code for the address program
# initialize the parser with references to handler routines # use XML::Parser; my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start, Final => \&handle_doc_end, Start => \&handle_elem_start, End => \&handle_elem_end, Char => \&handle_char_data, }); # # globals # my $record; # points to a hash of element contents my $context; # name of current element my %records; # set of address entries # # read in the data and run the parser on it # my $file = shift @ARGV; if( $file ) { $parser->parsefile( $file ); } else { my $input = ""; while( <STDIN> ) { $input .= $_; } $parser->parse( $input ); } exit; ### ### Handlers ### # # As processing starts, output the beginning of an HTML file. # sub handle_doc_start { print "<html><head><title>addresses</title></head>\n"; print "<body><h1>addresses</h1>\n"; } # # save element name and attributes # sub handle_elem_start { my( $expat, $name, %atts ) = @_; $context = $name; $record = {} if( $name eq 'entry' ); } # collect character data into the recent element's buffer # sub handle_char_data { my( $expat, $text ) = @_; # Perform some minimal entitizing of naughty characters $text =~ s/&/&/g; $text =~ s/</</g; $record->{ $context } .= $text; } # # if this is an <entry>, collect all the data into a record # sub handle_elem_end { my( $expat, $name ) = @_; return unless( $name eq 'entry' ); my $fullname = $record->{'last'} . $record->{'first'}; $records{ $fullname } = $record; } # # Output the close of the file at the end of processing. # sub handle_doc_end { print "<table border='1'>\n"; print "<tr><th>name</th><th>phone</th><th>address</th></tr>\n"; foreach my $key ( sort( keys( %records ))) { print "<tr><td>" . $records{ $key }->{ 'first' } . ' '; print $records{ $key }->{ 'last' } . "</td><td>"; print $records{ $key }->{ 'phone' } . "</td><td>"; print $records{ $key }->{ 'street' } . ', '; print $records{ $key }->{ 'city' } . ', '; print $records{ $key }->{ 'state' } . ' '; print $records{ $key }->{ 'zip' } . "</td></tr>\n"; } print "</table>\n</div>\n</body></html>\n"; }
To understand how this program works, we need to study the handlers.
All handlers called by XML::Parser
receive a
reference to the expat
parser object as their
first argument, a courtesy to developers in case they want to access
its data (for example, to check the input file’s
current line number). Other arguments may be passed, depending on the
kind of event. For example, the start-element event handler gets the
name of the element as the second argument, and then gets a list of
attribute names and values.
Our handlers use global variables to store information. If you don’t like global variables (in larger programs, they can be a headache to debug), you can create an object that stores the information internally. You would then give the parser your object’s methods as handlers. We’ll stick with globals for now because they are easier to read in our example.
The first
handler is
handle_doc_start
, called at the start of parsing.
This handler is a convenient way to do some work before processing
the document. In our case, it just outputs HTML code to begin the
HTML page in which the sorted address entries will be formatted. This
subroutine has no special arguments.
The next handler, handle_elem_start
, is called
whenever the parser encounters the start of a new element. After the
obligatory expat
reference, the routine gets two
arguments: $name
, which is the element name, and
%atts
, a hash of attribute names and values. (Note
that using a hash will not preserve the order of attributes, so if
order is important to you, you should use an @atts
array instead.) For this simple example, we don’t
use attributes, but we leave open the possibility of using them
later.
This routine sets up processing of an element by saving the name of
the element in a variable called $context
. Saving
the element’s name ensures that we will know what to
do with character data events the parser will send later. The routine
also initializes a hash called %record
, which will
contain the data for each of
<entry>
’s subelements in a
convenient look-up table.
The handler handle_char_data
takes care of
nonmarkup data—basically all the character data in elements.
This text is stored in the second argument, here called
$text
. The handler only needs to save the content
in the buffer $record->{ $context }
. Notice
that we append the character data to the buffer, rather than assign
it outright. XML::Parser
has a funny quirk in
which it calls the character handler after each line or
newline-separated string of text.[1] Thus, if the content of an element includes a newline
character, this will result in two separate calls to the handler. If
you didn’t append the data, then the last call would
overwrite the one before it.
Not surprisingly, handle_elem_end
handles the end
of element events. The second argument is the
element’s name, as with the start-element event
handler. For most elements, there’s not much to do
here, but for <entry>
, we have a final
housekeeping task. At this point, all the information for a record
has been collected, so the record is complete. We only have to store
it in a hash, indexed by the person’s full name so
that we can easily sort the records later. The sorting can be done
only after all the records are in, so we need to store the record for
later processing. If we weren’t interested in
sorting, we could just output the record as HTML.
Finally, the handle_doc_end
handler completes our
set, performing any final tasks that remain after reading the
document. It so happens that we do have something to do. We need to
print out the records, sorted alphabetically by contact name. The
subroutine generates an HTML table to format the entries nicely.
This example, which involved a flat sequence of records, was pretty simple, but not all XML is like that. In some complex document formats, you have to consider the parent, grandparent, and even distant ancestors of the current element to decide what to do with an event. Remembering an element’s ancestry requires a more sophisticated state-saving structure, which we will show in a later example.
[1] This way of reading text is uniquely Perlish. XML purists might be confused about this handling of character data. XML doesn’t care about newlines, or any whitespace for that matter; it’s all just character data and is treated the same way.
Get Perl and XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.