Joining data in the Mapper using MapReduce

Joining data in MapReduce is an expensive operation. Depending on the size of the datasets, you can choose to perform a map-side join or a reduce-side join. In a map-side join, two or more datasets are joined on a key in the map phase of a MapReduce job. In a reduce-side join, the mapper emits the join key, and the reduce phase is responsible for joining the two datasets. In this recipe we will demonstrate how to perform a map-side replicated join using Pig. We will join a weblog dataset, and a dataset containing a list of distinct IPs and their associated country. As the datasets will be joined in the map-phase, this will be a map-only job.

Getting ready

Download the apache_nobots_tsv.txt and nobots_ip_country_tsv.txt ...

Get Hadoop Real-World Solutions Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.