O'Reilly logo

Field Guide to Hadoop by Marshall Presser, Kevin Sitto

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 5. Analytic Helpers

Now that you’ve ingested data into your Hadoop cluster, what’s next? Usually you’ll want to start by simply cleansing or transforming your data. This could be as simple or reformatting fields and removing corrupt records or it could involve all manner of complex aggregation, enrichment, and summarization. Once you’ve cleaned up your data, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data or perhaps performing some sort of geospatial analysis.

In this chapter, we’re going to talk about two types of tools:

MapReduce interfaces

General-purpose tools that make it easier to process your data

Analytic libraries

Focused-purpose libraries that include functionality to make it easier to analyze your data

MapReduce Interfaces

In the early days of Hadoop, the only way to process the data in your system was to work with MapReduce in Java, but this approach presented a couple of major problems:

  • Your analytic writers need to not only understand your business and your data, but they also need to understand Java code

  • Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query

For example, the process of developing and testing a simple analytic written directly in MapReduce might look something ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required