How .to Summarize the Universe:
Dynamic Maintenance of Quantiles
Anna C. Gilbert
Yannis Kotidis S. Muthukrishnan
AT&T Labs Research
Florham Park, NJ
07032 USA
{ agilbert, kotidis, muthu, mstrauss} @research. att. corn
Martin J. Strauss
Order statistics, i.e., quantiles, are frequently
used in databases both at the database server
as well as the application level. For example,
they axe useful in selectivity estimation during
query optimization, in partitioning large rela-
tions, in estimating query result sizes when
building user interfaces, and in characterizing
the data distribution of evolving datasets in
the process of data mining.
We present a new algorithm for dynamically
computing quantiles of a relation subject to
insert as well as delete operations. The algo-
rithm monitors the operations and maintains
a simple, small-space representation (based on
random subset sums
or RSSs) of the underly-
ing data distribution. Using these RSSs, we
can quickly estimate, without having to ac-
cess the data, all the quantiles, each guaran-
teed to be accurate to within user-specified
precision. Previously-known one-pass quan-
tile estimation algorithms that provide sim-
ilar quality and performance guarantees can
not handle deletions. Other algorithms that
can handle delete operations cannot guaran-
tee performance without rescanning the entire
We present the algorithm, its theoretical per-
formance analysis and extensive experimental
results with synthetic and real datasets. In-
dependent of the rates of insertions and dele-
Permission to copy without fee all or part o] this material is
granted provided that the copies are not made or distributed ]or
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 28th VLDB Conferences
Hong Kongs Chinas 2002
tions, our algorithm is remarkably precise at
estimating quantiles in small space, as our ex-
periments demonstrate.
1 Introduction
Most database management systems (DBMSs) main-
tain order statistics, i.e., quantiles, on the contents of
their database relations. Medians (half-way points)
and quartiles (quarter-way points) are elementary or-
der statistics. In the general case, the ¢-quantiles of
an ordered sequence of N data items are the values
with rank
2,... 1/¢.
Quantiles find multiple uses in databases. Sim-
ple statistics such as the mean and variance are both
insufficiently descriptive and highly sensitive to data
anomalies in real world data distributions. Quantiles
can summarize massive database relations more ro-
bustly. Many commercial DBMSs use
equi-depth his-
[21, 23], which are in fact quantiles, during
query optimization in order to estimate the size of in-
termediate results and pick competitive query execu-
tion plans. Quantiles can also be used for determining
association rules for data mining applications [1, 3, 2].
Quantile distribution helps design well-suited user in-
terfaces to visualize query result sizes. Also, quantiles
provide a quick similarity check in coarsely comparing
relations, which is useful in data cleaning [16]. Finally,
they are used as splitters in parallel database systems
that employ value range data partitioning [22] or for
fine-tuning external sorting algorithms [9].
Computing quantiles on demand in many of the
above applications is prohibitively expensive as it in-
volves scanning large relations. Therefore, quantiles
are precomputed within DBMSs. The central chal-
lenge then is to
them since database relations
evolve via transactions. Updates, inserts and deletes
change the data distribution of the values stored in
relations. As a result, quantiles have to be updated
to faithfully reflect the changes in the underlying data
distribution. Commercial database systems often hide

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.