General-purpose crawlers take a centralized, snapshot view of what is essentially a completely distributed hypermedium in uncontrolled flux. They seek to collect and process the entire contents of the Web in a centralized location, where it can be indexed in advance to be able to respond to any possible query. Meanwhile, the Web, already having two billion pages, keeps growing and changing to make centralized processing more difficult. An estimated 600 GB worth of pages changed per month in 1997 alone [120].

In its initial days, most of the Web could be collected by small- to medium-scale crawlers. From 1996 to 1999, coverage was a very stiff challenge: from an estimated coverage of 35% in 1997 [16], crawlers dropped ...

Get Mining the Web now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.