Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 4. Left Outer Join

This chapter shows you how to implement a left outer join in the MapReduce environment. I provide three distinct implementations in MapReduce/Hadoop and Spark:

  • MapReduce/Hadoop solution using the classic map() and reduce() functions

  • Spark solution without using the built-in JavaPairRDD.leftOuterJoin()

  • Spark solution using the built-in JavaPairRDD.leftOuterJoin()

Left Outer Join Example

Consider a company such as Amazon, which has over 200 million users and can do hundreds of millions of transactions per day. To understand the concept of a left outer join, assume we have two types of data: users and transactions. The users data consists of users’ location information (say, location_id) and the transactions data includes user identity information (say, user_id), but no direct information about a user’s location. Given users and transactions, then:

users(user_id, location_id)
transactions(transaction_id, product_id, user_id, quantity, amount)

our goal is to find the number of unique locations in which each product has been sold.

But what exactly is a left outer join? Let T1 (a left table) and T2 (a right table) be two relations defined as follows (where t1 is attributes of T1 and t2 is attributes of T2):

  • T1 = (K, t1)

  • T2 = (K, t2)

The result of a left outer join for relations T1 and T2 on the join key of K contains all records of the left table (T1), even if the join condition does not find any matching record in the right table (T2). If the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content