Joining Ranked Inputs in Practice*
Ihab F. Ilyas
Walid G. Aref
Department of Computer Sciences, Purdue University
1398 Computer Science Building
West Lafayette IN 47907-1398
USA
{ilyas,aref}@cs.purdue.edu
Ahmed K. Elmagarmid
Hewlett Packard
1501 Page Mill Rd.
Palo Alto, CA 94304
USA
ahmed_elmagarmid@hp.com
Abstract
Joining ranked inputs is an essential require-
ment for many database applications, such as
ranking search results from multiple search
engines and answering multi-feature queries
for multimedia retrieval systems. We intro-
duce a new practical pipelined query operator,
termed NRA-RJ, that produces a global rank
from input ranked streams based on a score
function. The output of NRA-RJ can serve as
a valid input to other NRA-RJ operators in
the query pipeline. Hence, the NRA-RJ oper-
ator can support a hierarchy of join operations
and can be easily integrated in query process-
ing engines of commercial database systems.
The NRA-RJ operator bridges Fagin's opti-
mal aggregation algorithm into a practical im-
plementation and contains several optimiza-
tions that address performance issues. We
compare the performance of NRA-RJ against
recent rank join algorithms. Experimental re-
sults demonstrate the performance trade-offs
among these algorithms. The experimental
results are based on an empirical study ap-
plied to a medical video application on top
of a prototype database system. The study
reveals important design options and shows
that the NRA-RJ operator outperforms other
pipelined rank join operators when the join
condition is an equi-join on key attributes.
The support of the National Science Foundation un-
der Grants IIS-0093116, EIA-9972883, IIS-9974255, and F.IA-
9983249 is gratefully acknowledged.
Permission to copy without fee all
or
part o] this material is
granted provided that the copies are not made or distributed ]or
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission o] the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission ]rom the Endowment.
Proceedings of the 28th VLDB Conference,
ttong Kong, China, 2002
1 Introduction
A growing number of database applications require the
processing of ranking queries based on multiple at-
tributes. In the context of multimedia retrieval, pred-
icates often involve image similarity matching with~
.
respect to several features. Users may present an
example image, and query the database for images
"most similar" to the example based On color and tex-
ture. Although each database image object can eas-
ily be ranked for color and texture separately, results
must be presented to the user in a combined simi-
larity order. Another example from information re-
trieval is the search for documents containing search
topics from multiple sources [19]. While each source
provides retrieved documents sorted by relevance, the
collection of retrieved documents returned to the user
must be sorted in a combined relevance order. Other
examples include information retrieval based on mul-
tiple keywords, CAD similarity searches, geometrical
databases, medical imaging and molecular biology.
The query evaluation model used for a similarity
search does not generally return a collection of exact
matches, but rather a ranked collection of results with
a score attached to each result. Users are usually in-
terested in only the top-ranked results. The aggregate
score for a given result is obtained by combining the
scores of several
atomic
similarity rankings, where the
atomic ranking is based on a single feature or attribute
of the database object. For example, atomic rank-
ings are produced in single-feature similarity queries in
multimedia retrieval [5, 8, 11] and in document ranking
based on a single search topic in information retrieval
applications. The challenge for similarity search is the
ranking of results based on an overall aggregate score,
using several atomic rankings as input. A naive ap-
proach is to calculate a global score for each object
in the database and sort the objects according to the
computed score. This approach is not scalable, how-
ever, and the performance of the system deteriorates
rapidly as the size of the database increases.
Many algorithms have been proposed in the litera-
950

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.