Tree Pattern Aggregation for Scalable XML Data Dissemination
Chee-Yong Chan, Wenfei Fan, Pascal Felber~ Minos Garofalakis, Rajeev Rastogi
Bell Labs, Lucent Technologies
{cychan, wenfei, minos, rastogi}@research.bell-labs, com, Pascal. Felber@eurecom. fr
Abstract
With the rapid growth of XML-document traffic on the
Internet, scalable content-based dissemination of XML
documents to a large, dynamic group of consumers has
become an important research challenge. To indicate
the type of content that they are interested in, data
consumers typically specify their subscriptions using
some XML pattern specification language (e.g., XPath).
Given the large volume of subscribers, system scalabil-
ity and efficiency mandate the ability to
aggregate
the
set of consumer subscriptions to a smaller set of con-
tent specifications, so as to both reduce their storage-
space requirements as well as speed up the document-
subscription matching process. In this paper, we pro-
vide the first systematic study of subscription aggre-
gation where subscriptions are specified with
tree pat-
terns
(an important subclass of XPath expressions). The
main challenge is to aggregate an input set of tree pat-
terns into a smaller set of generalized tree patterns such
that: (1) a given
space constraint
on the total size of the
subscriptions is met, and (2) the
loss in precision
(due
to aggregation) during document filtering is minimized.
We propose an efficient tree-pattern aggregation algo-
rithm that makes effective use of document-distribution
statistics in order to compute a
precise
set of aggregate
tree patterns within the allotted space budget. As part
of our solution, we also develop several novel algo-
rithms for tree-pattern containment and minimization,
as well as "least-upper-bound" computation for a set of
tree patterns. These results are of interest in their own
fight, and can prove useful in other domains, such as
XML query optimization. Extensive results from a pro-
totype implementation validate our approach.
1 Introduction
XML (eXtensible Markup Language) [16] has become
the dominant standard for data encoding and exchange
*Currently on leave from Temple University and supported in part by
NSF Career Award IIS-0093168.
t Current affiliation: lnstitut EURECOM, Sophia Antipolis, France
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial
advantage, the VLDB copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Very Large Data Base Endowment. To copy otherwise, or to republish,
requires a fee and~or special permission from the Endowment.
Proceedings of the
28th VLDB
Conference,
l-long Kong, China, 2002
on the Internet, including e-Business transactions in both
Business-to-Business (B2B) and Business-to-Consumer
(B2C) applications. Given the rapid growth of XML traf-
fic on the Internet, the effective and efficient delivery of
XML documents has become an important issue. Con-
sequently, there is growing interest in the area of XML
content-based filtering and routing
(e.g., [4]), which ad-
dresses the problem of effectively directing high volumes
of XML-document traffic to interested consumers based
on document
contents.
Unlike conventional routing, where
packets are routed based on a limited, fixed set of attributes
(e.g., source/destination IP addresses and port numbers),
content-based routing is based on general patterns of the
document contents, which is significantly more flexible and
demanding. Consumers typically specify their
subscrip-
tions,
indicating the type of XML content that they are
interested in, using some XML pattern specification lan-
guage (e.g., XPath [15]). For each incoming XML docu-
ment, a
content-based router
matches the document con-
tents against the set of subscriptions to identify the (sub)set
of interested consumers, and then routes the document to
them. Thus, in content-based routing, the "destination" of
an XML document is generally unknown to the data pro-
ducer, and is computed
dynamically
based on the document
contents and the active set of subscriptions.
Effective support for scalable, content-based XML rout-
ing is crucial to enabling efficient and timely delivery of
relevant XML documents to a large, dynamic group of con-
sumers. Given the large volume of potential consumers,
system scalability and efficiency madate the ability to ju-
diciously
aggregate
the set of consumer subscriptions to a
smaller set of content specifications. The goal, of course,
is to both reduce the subscriptions' storage space require-
ments (e.g., so that the routing table fits in main memory),
as well as speed up the filtering of incoming XML traf-
fic. For instance, a core router in a B2B application may
choose to aggregate subscriptions based on geographical
location, affiliation, or domain-specific information (e.g.,
telecommunications). Subscription aggregation essentially
involves aggregating an initial set of subscriptions S into a
smaller set A such that any document that matches some
subscription in S also matches some subscription in A.
However, since there is typically a
"loss of precision"
as-
sociated with such aggregation, the documents matched by
the aggregated set A is, in general, a superset of those
matched by the original set S. As a result, a document
may be routed to consumers who have not subscribed to
it, thus resulting in an increase in the amount of unwanted
826

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.