With Hadoop in place, we can now focus on the distributed processing frameworks that we will use for analysis.
The first of the distributed processing frameworks that we will examine is Pig. Pig is a framework for data analysis. It allows the user to articulate analysis in a simple high-level language. These scripts then compile down to MapReduce jobs.
Although Pig can read data from a few different systems (for example, S3), we will use HDFS as our data storage mechanism in this example. Thus, the first step in our analysis is to copy the data into HDFS.
To do this, we issue the following Hadoop commands:
hadoop fs -mkdir /user/bone/temp hadoop fs -copyFromLocal click_thru_data.txt ...