O'Reilly logo

Hadoop For Dummies by Dirk deRoos

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6

MapReduce Programming

In This Chapter

arrow Thinking in parallel

arrow Working with key/value pairs

arrow Tracking your application flow

arrow Running the sample MapReduce application

After you’ve stored reams and reams of data in HDFS (a distributed storage system spread over an expandable cluster of individual slave nodes), the first question that comes to mind is “How can I analyze or query my data?” Transferring all this data to a central node for processing isn’t the answer, since you’ll be waiting forever for the data to transfer over the network (not to mention waiting for everything to be processed serially). So what’s the solution? MapReduce!

As we describe in Chapter 1, Google faced this exact problem with their distributed Google File System (GFS), and came up with their MapReduce data processing model as the best possible solution. Google needed to be able to grow their data storage and processing capacity, and the only feasible model was a distributed system. In Chapter 4, we look at a number of the benefits of storing data in the Hadoop Distributed File System (HDFS): low cost, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required