Appendix D. Optimized MapReduce join frameworks
In this appendix we’ll look at the two join frameworks we used in chapter 4. The first is the repartition join framework, which lessens the required memory footprint of the Hadoop join implementation in the org.apache.hadoop.contrib.utils.join package. The second is a framework provided to perform a replicated join, and you’ll build in some smarts that will allow you to cache the smaller of the datasets being joined.
D.1. An optimized repartition join framework
The Hadoop contrib join package requires that all the values for a key be loaded into memory. How can you implement a reduce-side join without that memory space overhead? In this optimization you’ll cache the dataset that’s smallest in ...