Data Structures in Web Information Retrieval*
Monika Henzinger
University of Vienna
Index Compression•Index Granularity
51.4Finding Near-Duplicate Documents
Current search engines process thousands of queries per second over a collection of billions of web pages with a sub-second average response time. There are two reasons for this astonishing performance: Massive parallelism and a simple yet efficient data structure, called inverted index.
In this chapter we will describe inverted indices. The parallelism deployed by search engines is quite straightforward: Given a collection of documents and a user query the goal of information retrieval ...
Get Handbook of Data Structures and Applications, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.