Efficient Structural Joins on Indexed XML Documents
Shu-Yao Chien
NCR/Teradata Division
sc 180001
~ncr. corn
Zografoula Vagena
CSE Dept., UC Riverside
CSE Dept., U C Riverside
Vassilis J. Tsotras*
CSE Dept., UC Riverside
Queries on XML documents typically combine
selections on element contents, and, via path
expressions, the structural relationships be-
tween tagged elements. Structural joins are
used to find all pairs of elements satisfying the
primitive structural relationships specified in
the query, namely,
parent-child and ancestor-
relationships. Efficient support for
structural joins is thus the key to efficient
implementations of XML queries. Recently
proposed node numbering schemes enable the
capturing of the XML document structure us-
ing traditional indices (such as B+-trees or
R-trees). This paper proposes efficient struc-
tural join algorithms in the presence of tag
indices. We first concentrate on using B+-
trees and show how to expedite a structural
join by avoiding collections of elements that
do not participate in the join. We then intro-
duce an enhancement (based on
sibling point-
that further improves performance. Such
sibling pointers are easily implemented and
dynamically maintainable. We also present a
structural join algorithm that utilizes R-trees.
An extensive experimental comparison shows
that the B+-tree structural joins are more ro-
bust. Furthermore, they provide drastic im-
provement gains over the current state of the
* This research was supported by NSF grants IIS-9907477,
EIA-9983445, IIS-0070135,
and the
Dept. of Defense.
Permission to copy without fee all or part of this material is
granted provided
that the copies are not
made or distributed ,for
direct commercial
advantage, the
VLDB copyright notice
the title of the publication
and its date appear, and
notice is
given that copying is by permission of the Very Large
Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the
28th VLDB Conference,
Hong Kong, China, 2002
1 Introduction
The problem of managing and querying XML doc-
uments efficiently poses interesting challenges for
database researchers. XML documents can have a
rather complex internal structure; in fact, an XML
document can be viewed as an ordered tree. Tree
nodes correspond to document elements (or attributes)
while edges represent direct element-subelement rela-
tionships. This tree-centric representation is appar-
ent in XML languages like Quilt [8], XQuery [39] and
XPath [38], which qualify documents for retrieval both
by their structure and the values in their elements.
For instance, a value-based selection for a document
can be specified by conditions on its elements' names
(tags), their attributes, and the text strings (i.e., PC-
DATA) contained in such elements. A structure-based
selection instead relies on the structural relationships,
namely: parent-child and ancestor-descendant rela-
tionships. For example, the query:
finds all figures with caption 'R-tree' under sections
whose title is "Overview". This complex query con-
sists of three conditions:
the conditions
figure[caption----"R-tree"] are
value-based since
they select document elements using their values
or contents, while
the double slash in section//figure corre-
sponds to a structural condition, in particular, an
ancestor-descendant relationship. It is a short-
hand for a
path expression
specifying that there
must be a path leading from the first element to
the second one (i.e., the second element must be
a descendant of the first in the document tree).
Using path expressions, users are allowed to navigate
through arbitrary long paths in the tree. A single slash
in the query (section/figure), would find only those
figures that are children of section elements (i.e., a
parent-child relationship).

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.