Chapter 2. Summarization Patterns

Your data is large and vast, with more data coming into the system every day. This chapter focuses on design patterns that produce a top-level, summarized view of your data so you can glean insights not available from looking at a localized set of records alone. Summarization analytics are all about grouping similar data together and then performing an operation such as calculating a statistic, building an index, or just simply counting.

Calculating some sort of aggregate over groups in your data set is a great way to easily extract value right away. For example, you might want to calculate the total amount of money your stores have made by state or the average amount of time someone spends logged into your website by demographic. Typically, with a new data set, you’ll start with these types of analyses to help you gauge what is interesting or unique in your data and what needs a closer look.

The patterns in this chapter are numerical summarizations, inverted index, and counting with counters. They are more straightforward applications of MapReduce than some of the other patterns in this book. This is because grouping data together by a key is the core function of the MapReduce paradigm: all of the keys are grouped together and collected in the reducers. If you emit the fields in the mapper you want to group on as your key, the grouping is all handled by the MapReduce framework for free.

Numerical Summarizations

Pattern Description

The numerical summarizations ...

Get MapReduce Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.