O'Reilly logo

Storm Blueprints: Patterns for Distributed Real-time Computation by Brian O'Neill, P. Taylor Goetz

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Deploying the analytics

With Hadoop in place, we can now focus on the distributed processing frameworks that we will use for analysis.

Performing a batch analysis with the Pig infrastructure

The first of the distributed processing frameworks that we will examine is Pig. Pig is a framework for data analysis. It allows the user to articulate analysis in a simple high-level language. These scripts then compile down to MapReduce jobs.

Although Pig can read data from a few different systems (for example, S3), we will use HDFS as our data storage mechanism in this example. Thus, the first step in our analysis is to copy the data into HDFS.

To do this, we issue the following Hadoop commands:

hadoop fs -mkdir /user/bone/temp
hadoop fs -copyFromLocal click_thru_data.txt ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required