O'Reilly logo

Apache Solr 3.1 Cookbook by Rafał Kuć

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

How to fetch and index web pages

There are many ways to index web pages. We could download them, parse them, and index with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem—how do you fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.

Getting ready

For the purpose of this recipe we will be using version 1.2 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.

How to do it...

First of all, we need to install Apache Nutch. To do that, we just need to extract the downloaded archive ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required