Skip to Content
MapReduce Design Patterns
book

MapReduce Design Patterns

by Donald Miner, Adam Shook
December 2012
Intermediate to advanced content levelIntermediate to advanced
247 pages
6h 48m
English
O'Reilly Media, Inc.
Content preview from MapReduce Design Patterns

Chapter 3. Filtering Patterns

The patterns in this chapter all have one thing in common: they don’t change the actual records. These patterns all find a subset of data, whether it be small, like a top-ten listing, or large, like the results of a deduplication. This differentiates filtering patterns from those in the previous chapter, which was all about summarizing and grouping data by similar fields to get a top-level view of the data. Filtering is more about understanding a smaller piece of your data, such as all records generated from a particular user, or the top ten most used verbs in a corpus of text. In short, filtering allows you to apply a microscope to your data. It can also be considered a form of search. If you are interested in finding all records that involve a particular piece of distinguishing information, you can filter out records that do not match the search criteria.

Sampling, one common application of filtering, is about pulling out a sample of the data, such as the highest values for a particular field or a few random records. Sampling can be used to get a smaller, yet representative, data set in which more analysis can be done without having to deal with the much larger data set. Many machine learning algorithms simply do not work efficiently over a large data set, so tools that build models need to be applied to a smaller subset.

A subsample can also be useful for development purposes. Simply grabbing the first thousand records typically is not the best sample ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Microservices Patterns

Microservices Patterns

Chris Richardson
Java Concurrency in Practice

Java Concurrency in Practice

Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, Doug Lea
Machine Learning Design Patterns

Machine Learning Design Patterns

Valliappa Lakshmanan, Sara Robinson, Michael Munn

Publisher Resources

ISBN: 9781449341954Errata PageSupplemental Content