Intradomain web crawling using Apache Nutch

Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes a great deal of work. A full-fledged crawler needs to be distributed, has to obey the best practices such as not overloading servers and obey robots.txt, performing periodic crawls, prioritizing the pages to crawl, identifying many formats of documents, and so on. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.

In this recipe, we are going to use Apache Nutch in the standalone ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.