Chapter 25. MapReduce

The future has already arrived. It’s just not evenly distributed yet.

William Gibson

MapReduce is a programming model for performing parallel processing on large datasets. Although it is a powerful technique, its basics are relatively simple.

Imagine we have a collection of items we’d like to process somehow. For instance, the items might be website logs, the texts of various books, image files, or anything else. A basic version of the MapReduce algorithm consists of the following steps:

  1. Use a mapper function to turn each item into zero or more key/value pairs. (Often this is called the map function, but there is already a Python function called map and we don’t need to confuse the two.)

  2. Collect together all the pairs with identical keys.

  3. Use a reducer function on each collection of grouped values to produce output values for the corresponding key.

Note

MapReduce is sort of passé, so much so that I considered removing this chapter from the second edition. But I decided it’s still an interesting topic, so I ended up leaving it in (obviously).

This is all sort of abstract, so let’s look at a specific example. There are few absolute rules of data science, but one of them is that your first MapReduce example has to involve counting words.

Example: Word Count

DataSciencester has grown to millions of users! This is great for your job security, but it makes routine analyses slightly more difficult.

For example, your VP of Content wants to know what sorts ...

Get Data Science from Scratch, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.