I/O-Conscious Data Preparation for
Large-Scale Web Search Engines
Maxim Lifantsev
Department of Computer Science
SUNY at Stony Brook
Stony Brook, NY 11794
Tzi-cker Chiueh
Department of Computer Science
SUNY at Stony Brook
Stony Brook, NY 11794
Given that commercial search engines cover
billions of web pages, efficiently managing the
corresponding volumes of disk-resident data
needed to answer user queries quickly is a
formidable data manipulation challenge. We
present a general technique for efficiently car-
rying out large sets of simple transformation
or querying operations over external-memory
data tables. It greatly reduces the number
of performed disk accesses and seeks by max-
imizing the temporal locality of data access
and organizing most of the necessary disk ac-
cesses into long sequential reads or writes of
data that is reused many times while in mem-
ory. This technique is based on our expe-
rience from building a functionally complete
and fully operational web search engine called
As such, it is in particular well suited
for most data manipulation tasks in a modern
web search engine and is employed throughout
The key idea of this technique is co-
ordinated partitioning of related data tables
and corresponding partitioning and delayed
batched execution of the transformation and
querying operations that work with the data.
This data and processing partitioning is natu-
rally compatible with distributed data storage
and parallel execution on a cluster of worksta-
tions. Empirical measurements on the
prototype demonstrate that our technique can
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 28th VLDB Conference,
Hong Kong, China, 2002
improve the performance of external-memory
data preparation runs by a factor of 100 versus
a straightforward implementation.
1 Introduction
Web search engines such as AltaVista [3], Fast
Search [2] and Google [11] are an indispensable tool
for web surfers to access the information on the global
Web. In the past two years, we have been working
on the implementation of a full-scale web search en-
gine that is based on a more general and powerful re-
source ranking model [16, 17] than Google's PageR-
ank [19]. During this effort, we found that the data
preparation process in a search engine poses a set of
different requirements than traditional database ap-
plications have, and thus deserves development of new
techniques to optimize its performance. This paper
reports the results of this investigation.
To answer user queries efficiently, search engines in
particular need to prepare an index, which is usually in
the form of an inverted index file. That is, to a first ap-
proximation it associates each keyword with the list of
pages in which the keyword appears. To construct an
inverted index, a search engine needs to collect pages
from the Web, parse them, assign an identifier to each
page, and put the identifier of each page into the hit
lists for all keywords that it contains. In this process,
various types of data structures other than the inverted
index itself need to be created. In addition more re-
cently developed search engines such as Google [11]
compute various scores for all web pages by perform-
ing a non-trivial iterative computation over the whole
web linkage graph. Because of the sheer data volume
involved, most of the data tables, including the in-
verted index and the data for page score computation,
can not fit into the main memory of the processing
servers: Modern search engines presently index over
two billion web pages [11] and manipulate terabytes
of data. As a result, use of efficient external-memory

Get Proceedings 2002 VLDB Conference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.