O'Reilly logo

Apache Solr 3 Enterprise Search Server by Eric Pugh, David Smiley

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Nutch for crawling web pages

A very common source of data to to be searchable is content in web pages, either from the Internet or inside the firewall. The long-time popular solution for crawling and searching web pages is Nutch, a former Lucene sub-project. Nutch is focused on performing Internet scale web crawling similar to Google with components such as a web crawler, a link graphing database, and parsers for HTML and other common formats found on the Internet. Nutch is designed to scale horizontally over multiple machines during crawling using the bigdata platform Hadoop to manage the work.

Nutch has gone through varying levels of activity and community involvement and recently reached version 1.3. Previously Nutch used its own custom search ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required