Chapter 4. Left Outer Join
This chapter shows you how to implement a left outer join in the MapReduce environment. I provide three distinct implementations in MapReduce/Hadoop and Spark:
-
MapReduce/Hadoop solution using the classic
map()andreduce()functions -
Spark solution without using the built-in
JavaPairRDD.leftOuterJoin() -
Spark solution using the built-in
JavaPairRDD.leftOuterJoin()
Left Outer Join Example
Consider a company such as Amazon, which has over 200 million users and can do hundreds of millions of transactions per day. To understand the concept of a left outer join, assume we have two types of data: users and transactions. The users data consists of users’ location information (say, location_id) and the transactions data includes user identity information (say, user_id), but no direct information about a user’s location. Given users and transactions, then:
users(user_id,location_id)transactions(transaction_id,product_id,user_id,quantity,amount)
our goal is to find the number of unique locations in which each product has been sold.
But what exactly is a left outer join? Let T1 (a left table) and T2 (a right table) be two relations defined as follows (where t1 is attributes of T1 and t2 is attributes of T2):
T1 = (K, t1)
T2 = (K, t2)
The result of a left outer join for relations T1 and T2 on the join key of K contains all records of the left table (T1), even if the join condition does not find any matching record in the right table (T2). If the ...