Appendix D. Optimized MapReduce join frameworks

In this appendix we’ll look at the two join frameworks we used in chapter 4. The first is the repartition join framework, which lessens the required memory footprint of the Hadoop join implementation in the org.apache.hadoop.contrib.utils.join package. The second is a framework provided to perform a replicated join, and you’ll build in some smarts that will allow you to cache the smaller of the datasets being joined.

D.1. An optimized repartition join framework

The Hadoop contrib join package requires that all the values for a key be loaded into memory. How can you implement a reduce-side join without that memory space overhead? In this optimization you’ll cache the dataset that’s smallest in ...

Get Hadoop in Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.