Appendix B. Understanding MapReduce
In December 2004, Google published a paper, “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, summarizing the authors’ solution to Google’s urgent need to simplify cluster computing. Dean and Ghemawat settled on a paradigm in which parts of a job are mapped (dispatched) to all nodes in a cluster. Each node produces a slice of the intermediary result set. All those slices are then reduced (aggregated) back to the final result.
The MapReduce paper (http://mng.bz/8s06) solves these three main problems:
- Parallelization— How to parallelize the computation
- Distribution— How to distribute the data
- Fault-tolerance— How to handle component failure
The core of MapReduce ...