O'Reilly logo

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset by Michael Frampton

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 3

image

Collecting Data with Nutch and Solr

Many companies collect vast amounts of data from the web by using web crawlers such as Apache Nutch. Available for more than ten years, Nutch is an open-source product provided by Apache and has a large community of committed users. An Apache Lucene open-source search platform, Solr can be used in connection with Nutch to index and search the data that Nutch collects. When you combine this functionality with Hadoop, you can store the resulting large data volume directly in a distributed file system.

In this chapter, you will learn a number of methods to connect various releases of Nutch to Hadoop. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required