CHAPTER 8 RESOURCE DISCOVERY

General-purpose crawlers take a centralized, snapshot view of what is essentially a completely distributed hypermedium in uncontrolled flux. They seek to collect and process the entire contents of the Web in a centralized location, where it can be indexed in advance to be able to respond to any possible query. Meanwhile, the Web, already having two billion pages, keeps growing and changing to make centralized processing more difficult. An estimated 600 GB worth of pages changed per month in 1997 alone [120].

In its initial days, most of the Web could be collected by small- to medium-scale crawlers. From 1996 to 1999, coverage was a very stiff challenge: from an estimated coverage of 35% in 1997 [16], crawlers dropped ...

Get Mining the Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.