O'Reilly logo

Hadoop MapReduce v2 Cookbook - Second Edition by Thilina Gunarathne

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Indexing and searching web documents using Apache Solr

Apache Solr is an open source search platform that is part of the Apache Lucene project. It supports powerful full-text search, faceted search, dynamic clustering, database integration, rich document (for example, Word and PDF) handling, and geospatial search. In this recipe, we are going to index the web pages crawled by Apache Nutch for use by Apache Solr and use Apache Solr to search through those web pages.

Getting ready

  1. Crawl a set of web pages using Apache Nutch by following the Intradomain web crawling using Apache Nutch recipe
  2. Solr 4.8 and later versions require JDK 1.7

How to do it...

The following steps show you how to index and search your crawled web pages dataset:

  1. Download and extract ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required