Appendix B. Web crawling

This appendix provides an overview of web crawling components, a brief description of the implementation details for the crawler provided with the book, and a few open-source crawlers written in Java.

An overview of crawler components

Web crawlers are used to discover, download, and store content from the Web. As we've seen in chapter 2, a web crawler is just a part of a larger application such as a search engine.

A typical web crawler has the following components:

  • A repository module to keep track of all URLs known to the crawler.

  • A document download module that retrieves documents from the Web using provided set of URLs.

  • A document parsing module that's responsible for extracting the raw content out of a variety of document ...

Get Algorithms of the Intelligent Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.