Chapter 5. Join Patterns

Having all your data in one giant data set is a rarity. For example, presume you have user information stored in a SQL database because it is updated frequently. Meanwhile, web logs arrive in a constant stream and are dumped directly into HDFS. Also, daily analytics that make sense of these logs are stored someone where in HDFS and financial records are stored in an encrypted repository. The list goes on.

Data is all over the place, and while it’s very valuable on its own, we can discover interesting relationships when we start analyzing these sets together. This is where join patterns come into play. Joins can be used to enrich data with a smaller reference set or they can be used to filter out or select records that are in some type of special list. The use cases go on and on as well.

In SQL, joins are accomplished using simple commands, and the database engine handles all of the grunt work. Sadly for us, joins in MapReduce are not nearly this simple. MapReduce operates on a single key/value pair at a time, typically from the same input. We are now working with at least two data sets that are probably of different structures, so we need to know what data set a record came from in order to process it correctly. Typically, no filtering is done prior to the join operation, so some join operations will require every single byte of input to be sent to the reduce phase, which is very taxing on your network. For example, joining a terabyte of data onto another ...

Get MapReduce Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.