MapReduce is the programming model that was implemented in the earliest versions of Hadoop. It's a very simple model and is designed to process large datasets on a distributed cluster in parallel batches. The core of MapReduce is composed of two programmable functions—a mapper that performs filtering, and a reducer that performs aggregation—and a shuffler that moves the objects from the mappers to the right reducers. Google published a paper in 2004 on MapReduce (https://ai.google/research/pubs/pub62), a few months after having been granted a patent on it.
Specifically, here are the steps of MapReduce for the Hadoop implementation:
- Data chunker: Data is read from the filesystem and split into chunks. A chunk is a piece of the input ...