Writing MapReduce programs
In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.
Getting started
In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the stream.py
script, as shown in Chapter 1, Introduction:
$ python stream.py –t –n 1000 > tweets.txt
We can then copy the dataset into HDFS with:
$ hdfs dfs -put tweets.txt <destination>
Tip
Note that until now we have been working only with the text of tweets. In the remainder of ...
Get Learning Hadoop 2 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.