LegoDB: Customizing Relational Storage
for XML Documents
Philip Bohannon t
Juliana Freire t*
Prasan Roy t
tLucent Bell Labs
600 Mountain Avenue
Murray Hill, NJ 07974, USA
{ bohannon,j uliana,prasan,simeon }
@research.bell-labs.com
Jayant R. Haritsa +
J6r6me Sim6on t
Maya Ramanath +
+Database Systems Lab, SERC
Indian Institute of Science
Bangalore 560012, INDIA
{haritsa,maya}
@dsl.serc.iisc.ernet.in
1 Introduction
XML is becoming the predominant data exchange format
in a variety of application domains (supply-chain, scientific
data processing, telecommunication infrastructure, etc.).
Not only is an increasing amount of XML data now be-
ing processed, but XML is also increasingly being used in
business-critical applications. Efficient and reliable storage
is an important requirement for these applications. By rely-
ing on relational engines for this purpose, XML developers
can benefit from a complete set of data management ser-
vices (including concurrency control, crash recovery, and
scalability) and from the highly optimized relational query
processors.
Because of the mismatch between the XML and the rela-
tional models and the many different ways to map an XML
document into relations, it is very hard to tune a relational
engine and ensure that XML queries will be evaluated ef-
ficiently. Most database vendors already offer solutions to
address the need for reliable XML storage. However, cur-
rent products
(e.g.,
[10]) require developers to go through
an often lengthy and complex process of manually defining
a mapping from XML into relations.
Strategies that automate the process of generating XML-
to-relational mappings have been proposed in the literature
(see,
e.g.,
[2, 4, 7, 8, 9]). Due to the flexibility of the XML
infrastructure, different XML applications exhibit widely
different characteristics
(e.g.,
permissive vs. strict schemas,
different access patterns). For example, a Web site may
perform a large volume of simple lookup queries, whereas
* Contact Author
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial
advantage, the VLDB copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Very Large Data Base Endowment. To copy otherwise, or to republish,
requires a.fee and~or special permission from the Endowment.
Proceedings of the 28th VLDB Conference,
Hong Kong, China, 2002
a catalog printing application may require large and com-
plex queries with deeply nested results. As we show in [ 1 ],
a fixed mapping or a mapping that does not take the appli-
cation characteristics into account is unlikely to work well
for more than a few of the wide variety of XML applica-
tions.
The purpose of this demonstration is to present the
LegoDB system, which is aimed at providing XML devel-
opers with an efficient storage solution tuned for a given
application.
2 Motivation
We motivate the need for finding appropriate storage map-
pings with an XML application scenario inspired from
the Internet Movie Database (IMDB) [6]. This database,
whose XML Schema is shown in Figure 1, contains a col-
lection of shows, movie directors and actors. Each show
can be either a movie or a TV show. Movies and TV shows
share some elements
(e.g.,
title
and
year
of produc-
tion), but there are also elements that are specific to each
show type
(e.g.,
only movies have a box_office, and
only TV shows have seasons). Sample data reflective
of real-world information that conforms with this schema
in shown in Figure 2.
Three possible relational storage mappings for the
IMDB schema are shown in Figure 3. Configuration (a)
results from inlining as many elements as possible in a
given table, roughly corresponding to the strategies pre-
sented in [8]. Configuration (b) is obtained from config-
uration (a) by partitioning the Reviews table into two
tables: one that contains New York Times reviews, and
another for reviews from other sources. Finally, config-
uration (c) is obtained from configuration (a) by splitting
the Show table into Movie shows (Show_Part 1) and TV
shows (Show_Part2). Even though each of these config-
urations can be the best for a given application, there are
cases where they perform poorly. The key point is that one
cannot decide which of these configurations will perform
1091

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.