The evolution of Hadoop
Around the year 2003, Doug Cutting and Mike Cafarella started work on a project called Nutch, a highly extensible, feature-rich, and open source crawler and indexer project. The goal was to provide an off-the-shelf crawler to meet the demands of document discovery. Nutch can work in a distributed fashion on a handful of machines and be polite by respecting the
robots.txt file on websites. It is highly extensible by providing the plugin architecture for developers to add custom components, for example, third-party plugins, to read different media types from the Web.
Robot Exclusion Standard or the robots.txt protocol is an advisory protocol that suggests crawling behavior. It is a file placed on website roots that suggest ...