The evolution of Hadoop

Around the year 2003, Doug Cutting and Mike Cafarella started work on a project called Nutch, a highly extensible, feature-rich, and open source crawler and indexer project. The goal was to provide an off-the-shelf crawler to meet the demands of document discovery. Nutch can work in a distributed fashion on a handful of machines and be polite by respecting the robots.txt file on websites. It is highly extensible by providing the plugin architecture for developers to add custom components, for example, third-party plugins, to read different media types from the Web.


Robot Exclusion Standard or the robots.txt protocol is an advisory protocol that suggests crawling behavior. It is a file placed on website roots that suggest ...

Get Hadoop: Data Processing and Modelling now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.