A Transducer-Based XML Query Processor*
Bertram Ludiischer t Pratik Mukhopadhyay t
Yannis Papakonstantinou $
t San Diego Supercomputer Center
University of California, San Diego
ludaesch¢sdsc, edu
SDepartment of Computer Science & Engineering
University of California, San Diego
{pmukhopa, yannis)@cs,
ucsd. edu
The XML Stream Machine (XSM) system is
a novel XQuery processing paradigm that is
tuned to the efficient processing of sequen-
tially accessed XML data (streams). The sys-
tem compiles a given XQuery into an XSM,
which is an XML stream transducer,
abstract device that takes as input one or
more XML data streams and produces one or
more output streams, potentially using inter-
nal buffers. We present a systematic way to
translate XQueries into efficient XSMs: First
the XQuery is translated into a network of
XSMs that correspond to the basic opera-
tors of the XQuery language and exchange
streams. The network is reduced to a single
XSM by repeated application of an XSM com-
position operation that is optimized to reduce
the number of tests and actions that the XSM
performs as well as the number of intermedi-
ate buffers that it uses. Finally, the optimized
XSM is compiled into a C program. First em-
pirical results illustrate the performance ben-
efits of the XSM-based processor.
XML is the standard for information exchange be-
tween applications and information sources. For ex-
ample, Web service implementations and mediators
exchange XML messages and data, and often in-
teract with information systems and databases via
* work partially supported by NSF/ITR OCE-0121726
ROADNET and NSF/NPACI ACI-9619020 NARA (Lud~scher),
NSF/DGov-9983510 I2T (Lud~cher, Mukhopadhyay, Papakon-
stantinou), NSF/IDM-9734548 CAREER (Papakonstantinou)
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed -for
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 28th VLDB Conference)
Hong Kong~ China~ 2002
XML import/export mechanisms. Web front-ends re-
ceive XML from the underlying information sources
and transform them into XHTML. Migration of XML
archives [MBR+00] to new, "refreshed" schemas re-
quires XML tranformations. Continuous data streams
from sensor networks [CFGR02] are transformed into
formats that can be readily consumed,
ing some procedures after detecting certain events
[ROA02]. The common denominator of the examples
above is that some sequentially accessed XML data
needs to be efficiently accessed and transformed into
some output XML data that fits the format required
by the consuming application.
A large number of important applications require
extremely efficient processing and transformation of
sequentially accessed streams. For example, there
are many ongoing works on efficient XSLT processors
IRes02, SAB, XAL, Kay], which convert XML files or
strings into XHTML. Other novel streaming applica-
efficient XML-based information filtering
do not require the expressive power of general XML
queries and transformation in the style of XQuery,
but are based on a limited subset,
XPath queries
[AF00, CFGR02].
The importance of efficient (XML) stream process-
ing will keep growing as communication and comput-
ing systems evolve. This is driven by two trends. The
first (and widely recognized trend) is the proliferation
of data stream sources whose number and bandwidth
increases as communication systems provide rapidly
increasing bandwidths and connectivity that allows re-
mote sources, such as sensors, to produce and emit
data streams. The second, long-term, driver of the
importance of efficient (XML) stream processing will
be the growth of stream bandwidths at rates that sur-
pass the growth of CPU-memory and CPU-disk access
and transfer rates. It is already inefficient for many
stream applications to buffer the data in disk-based
buffers and analyze it later. At some point it may
become impossible for high-bandwidth streams to be
pushed even to the RAM memory - the CPU will have
to quickly react to pieces of incoming data using just
its cache memory.
A qualitatively different architecture is needed

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.