Chapter 5. Map-Only Operations

This chapter begins the Analytic Patterns section of the book. In this chapter (and those beyond), we will walk you through a series of analytic patterns, an example of each, and a summary of information about when and where you might use them. As we go, you will learn and accumulate new abilities in your analytic toolkit.

This chapter focuses exclusively on what we’ll call map-only operations. A map-only operation is one that can handle each record in isolation, like the translator chimps from Chimpanzee and Elephant Inc.’s first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.

Technically, these operations can be run in the map or reduce phase of MapReduce. When a script exclusively uses map-only operations, they give rise to one mapper-only job, which executes the composed pipeline stages.

All of these are listed first and together for two reasons. One, they are largely fundamental; it’s hard to get much done without FILTER or FOREACH. Two, the way you reason about the performance impact of these operations is largely the same. Because these operations are trivially paralellizable, they scale efficiently and the computation cost rarely impedes throughput. And when pipelined, their performance cost can be summarized as “kids eat free with purchase of an adult meal.” For datasets of any material size, it’s very rare that the cost of preliminary or follow-on processing rivals ...

Get Big Data for Chimps now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.