Chapter 8. Searching and Indexing

In this chapter, we will cover the following recipes:

  • Generating an inverted index using Hadoop MapReduce
  • Intradomain web crawling using Apache Nutch
  • Indexing and searching web documents using Apache Solr
  • Configuring Apache HBase as the backend data store for Apache Nutch
  • Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
  • Elasticsearch for indexing and searching
  • Generating the in-links graph for crawled web pages

Introduction

MapReduce frameworks are well suited for large-scale search and indexing applications. In fact, Google came up with the original MapReduce framework specifically to facilitate the various operations involved with web searching. The Apache Hadoop project was also started as a subproject ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.