Appendix B. Understanding MapReduce

In December 2004, Google published a paper, “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, summarizing the authors’ solution to Google’s urgent need to simplify cluster computing. Dean and Ghemawat settled on a paradigm in which parts of a job are mapped (dispatched) to all nodes in a cluster. Each node produces a slice of the intermediary result set. All those slices are then reduced (aggregated) back to the final result.

The MapReduce paper ( solves these three main problems:

  • Parallelization— How to parallelize the computation
  • Distribution— How to distribute the data
  • Fault-tolerance— How to handle component failure

The core of MapReduce ...

