MapReduce is the programming model that was implemented in the earliest versions of Hadoop. It's a very simple model and is designed to process large datasets on a distributed cluster in parallel batches. The core of MapReduce is composed of two programmable functions—a mapper that performs filtering, and a reducer that performs aggregation—and a shuffler that moves the objects from the mappers to the right reducers. Google published a paper in 2004 on MapReduce (, a few months after having been granted a patent on it.

Specifically, here are the steps of MapReduce for the Hadoop implementation:

  • Data chunker: Data is read from the filesystem and split into chunks. A chunk is a piece of the input ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.