Skip to Content
MapReduce Design Patterns
book

MapReduce Design Patterns

by Donald Miner, Adam Shook
December 2012
Intermediate to advanced content levelIntermediate to advanced
247 pages
6h 48m
English
O'Reilly Media, Inc.
Content preview from MapReduce Design Patterns

Chapter 5. Join Patterns

Having all your data in one giant data set is a rarity. For example, presume you have user information stored in a SQL database because it is updated frequently. Meanwhile, web logs arrive in a constant stream and are dumped directly into HDFS. Also, daily analytics that make sense of these logs are stored someone where in HDFS and financial records are stored in an encrypted repository. The list goes on.

Data is all over the place, and while it’s very valuable on its own, we can discover interesting relationships when we start analyzing these sets together. This is where join patterns come into play. Joins can be used to enrich data with a smaller reference set or they can be used to filter out or select records that are in some type of special list. The use cases go on and on as well.

In SQL, joins are accomplished using simple commands, and the database engine handles all of the grunt work. Sadly for us, joins in MapReduce are not nearly this simple. MapReduce operates on a single key/value pair at a time, typically from the same input. We are now working with at least two data sets that are probably of different structures, so we need to know what data set a record came from in order to process it correctly. Typically, no filtering is done prior to the join operation, so some join operations will require every single byte of input to be sent to the reduce phase, which is very taxing on your network. For example, joining a terabyte of data onto another ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Microservices Patterns

Microservices Patterns

Chris Richardson
Java Concurrency in Practice

Java Concurrency in Practice

Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, Doug Lea
Machine Learning Design Patterns

Machine Learning Design Patterns

Valliappa Lakshmanan, Sara Robinson, Michael Munn

Publisher Resources

ISBN: 9781449341954Errata PageSupplemental Content