MapReduce

MapReduce is the programming model that was implemented in the earliest versions of Hadoop. It's a very simple model and is designed to process large datasets on a distributed cluster in parallel batches. The core of MapReduce is composed of two programmable functions—a mapper that performs filtering, and a reducer that performs aggregation—and a shuffler that moves the objects from the mappers to the right reducers. Google published a paper in 2004 on MapReduce (https://ai.google/research/pubs/pub62), a few months after having been granted a patent on it.

Specifically, here are the steps of MapReduce for the Hadoop implementation:

  • Data chunker: Data is read from the filesystem and split into chunks. A chunk is a piece of the input ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.