Chapter 8. Searching and Indexing

In this chapter, we will cover the following recipes:

  • Generating an inverted index using Hadoop MapReduce
  • Intradomain web crawling using Apache Nutch
  • Indexing and searching web documents using Apache Solr
  • Configuring Apache HBase as the backend data store for Apache Nutch
  • Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
  • Elasticsearch for indexing and searching
  • Generating the in-links graph for crawled web pages

Introduction

MapReduce frameworks are well suited for large-scale search and indexing applications. In fact, Google came up with the original MapReduce framework specifically to facilitate the various operations involved with web searching. The Apache Hadoop project was also started as a subproject ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.