51

Data Structures in Web Information Retrieval*

Monika Henzinger

University of Vienna

51.1Introduction

51.2Inverted Indices

Index CompressionIndex Granularity

51.3Fingerprints

51.4Finding Near-Duplicate Documents

51.5Conclusions

References

51.1Introduction

Current search engines process thousands of queries per second over a collection of billions of web pages with a sub-second average response time. There are two reasons for this astonishing performance: Massive parallelism and a simple yet efficient data structure, called inverted index.

In this chapter we will describe inverted indices. The parallelism deployed by search engines is quite straightforward: Given a collection of documents and a user query the goal of information retrieval ...

Get Handbook of Data Structures and Applications, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.