Chapter 6

MapReduce Programming

In This Chapter

Thinking in parallel

Working with key/value pairs

Tracking your application flow

Running the sample MapReduce application

After you’ve stored reams and reams of data in HDFS (a distributed storage system spread over an expandable cluster of individual slave nodes), the first question that comes to mind is “How can I analyze or query my data?” Transferring all this data to a central node for processing isn’t the answer, since you’ll be waiting forever for the data to transfer over the network (not to mention waiting for everything to be processed serially). So what’s the solution? MapReduce!

As we describe in Chapter 1, Google faced this exact problem with their distributed Google File System (GFS), and came up with their MapReduce data processing model as the best possible solution. Google needed to be able to grow their data storage and processing capacity, and the only feasible model was a distributed system. In Chapter 4, we look at a number of the benefits of storing data in the Hadoop Distributed File System (HDFS): low cost, ...

Get Hadoop For Dummies now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop For Dummies by

MapReduce Programming

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly