O'Reilly logo

Big Data for Chimps by Russell Jurney, Philip Kromer

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 7. Joining Tables

In this chapter, we’ll cover JOIN operations in Pig. A join is used to join multiple datasets or relations into a single relation based on the presence of a common key or keys. Pig supports several types of JOIN operations, including INNER, OUTER, and FULL joins. We’ll learn how to perform different kinds of joins in Pig, and we’ll also walk through how a join works at a low level, in Python/MrJob. By the end of the chapter, you’ll understand how to join like a pro.

To understand this chapter, it helps if you’re familiar with joining data from a SQL or related background. If you’re new to joins, a more thorough introduction will help. Check out Jeff Atwood’s post “A Visual Explanation of SQL Joins”.

In database terminology, a join combines the rows of two or more tables based on some matching information, known as a key. For example, you could join a table of names and a table of mailing addresses, so long as both tables had a common field for the user ID. You could also join a table of prices to a table of items, given an item ID column in both tables. Joins are useful because they permit people to normalize data (that is to say, eliminate redundant content between multiple tables) yet still bring several tables’ content to a single view on the fly.

Joins are pedestrian fare in relational databases. Far less so for Hadoop, as MapReduce wasn’t really created with joins in mind, and you have to go through acrobatics to make it work.1 Pig’s JOIN operator ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required