In This Chapter
Thinking in parallel
Working with key/value pairs
Tracking your application flow
Running the sample MapReduce application
After you’ve stored reams and reams of data in HDFS (a distributed storage system spread over an expandable cluster of individual slave nodes), the first question that comes to mind is “How can I analyze or query my data?” Transferring all this data to a central node for processing isn’t the answer, since you’ll be waiting forever for the data to transfer over the network (not to mention waiting for everything to be processed serially). So what’s the solution? MapReduce!
As we describe in Chapter 1, Google faced this exact problem with their distributed Google File System (GFS), and came up with their MapReduce data processing model as the best possible solution. Google needed to be able to grow their data storage and processing capacity, and the only feasible model was a distributed system. In Chapter 4, we look at a number of the benefits of storing data in the Hadoop Distributed File System (HDFS): low cost, ...