Adaptive Index Structures
Yufei Tao
Department of Computer Science
Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong
taoyf@ cs. ust. hk
Dimitris Papadias
Department of Computer Science
Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong
dimitris @ cs. ust. hk
Abstract
Traditional indexes aim at optimizing the node
accesses during query processing, which,
however, does not necessarily minimize the total
cost due to the possibly large number of random
accesses. In this paper, we propose a general
framework for adaptive indexes that improve
overall query cost. The performance gain is
achieved by allowing index nodes to contain a
variable number of disk pages. Update
algorithms dynamically re-structure adaptive
indexes depending on the data and query
characteristics. Extensive experiments show that
adaptive B- and R-trees significantly outperform
their conventional counterparts, while incurring
minimal update overhead.
1. Introduction
A single disk access usually includes (i) a
seek
operation
that positions the disk head at the requested sector
(including cylinder seek and rotation), and (ii) data
transfer to/from the main memory. Thus the total query
cost is the sum of disk seek, data transfer, and CPU time.
Although significant advances have been made to
accelerate CPU processing and data transfer, there is little
progress on improving the seek time due to the mechanic
nature of the disk head movement. The
average seek time
of latest models from major hard disk vendors, for
example, is around 10ms (almost the same as 10 years
ago), while the CPU costs and data transfer rates are
usually 1-2 orders of magnitude smaller. Such
performance difference is expected to become even
greater in the future, which renders seek time the
dominating factor in query cost.
Existing indexes focus on minimizing the number of
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the VI~B copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Very Large Data Base Endowment. To copy
otherwise, or to republish, requires a fee and~or special permission from
the Endowment
Proceedings of the
28 th
VLDB Conference,
Hong Kong, China, 2002
node accesses required for query processing. Since a
single node usually corresponds to a constant number
(typically, one) of disk pages, fewer node accesses lead to
a smaller number of page accesses. Minimizing the page
accesses, however, does not necessarily optimize the total
query cost. Consider, for example, that query q l accesses
20 pages whose addresses are consecutive, while
q2
must
visit 10 random pages (i.e., non-consecutive ones). Then,
the cost of ql is approximately
Tsr+20"TrRr+20"Tcev,
where
Tsr, TrRr, TcPu are
the costs of performing one disk
seek (around 10ms), transferring one page (lms/page),
and processing the records in one page respectively
(<0.1ms/page). Notice that only one disk seek is required
to visit continuous pages (i.e.,
sequential accesses).
Similarly, the cost of
q2
is
lO'Tsr+lO'TrR~lO'Tc~,v
(ten
seeks must be performed to locate all pages, i.e.,
random
accesses).
Given that,
Tsr
is usually significantly more
expensive (over an order of magnitude) than TrRF and
Tcev,
processing q l can be much cheaper than
q2.
The design of traditional indexes usually overlooks the
difference between sequential and random accesses. Since
the pages allocated to sibling nodes are often not
consecutive, a query (such as
q2)
may incur a large
number of random accesses. The traditional method to
reduce random accesses in databases (and general file
systems) is to re-organize the data pages by de-
fragmentation. This approach, however, has several
serious drawbacks. First, re-organization involves some
expensive operations: (i) moving (i.e., reading and
writing) a large number of pages, and (ii) correcting
mutual references (e.g., pointers from parent index nodes
to their children, references to foreign keys, etc).
Particularly, since references are ubiquitous (especially
for databases with complex ER schemata), correcting
them usually involves updating a very large part of the
database. Second, a good page organization may soon
degenerate by subsequent updates, in which case the
benefit of re-organization vanishes in spite of its huge
cost.
One approach to remedy this is to allocate several
continuous pages to a node at a time. Thus, the query
q2
mentioned earlier visits one leaf node (with ten pages) to
retrieve the same content, which reduces the random
access (i.e., seek) time. Setting the node size to a fixed
value, however, only favours queries with specific
418

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.