O'Reilly logo

Big Data for Chimps by Russell Jurney, Philip Kromer

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 9. Duplicate and Unique Records

This chapter will cover the data processing of duplicate and unique records. We define duplicate records as those with the same value in the same field across two or more records. Unique records are those for which, for the value of a given field, no other records have the same value. Note that in each case we must describe which field(s) we mean when we say unique or duplicate. Pig is no different: by default, the DISTINCT command uses all fields, but we can trim fields from data relations to evaluate uniqueness in different ways, in terms of different fields.

We often find ourselves dealing with multiple records for a given concept or entity. At those times, we may want to reduce our data to just one, unique instance of each key. We’ll introduce the operations UNION and DISTINCT, and various DataFu user-defined functions (UDFs) that achieve this operation.

We’ll also introduce set operations among relations using Pig, and set operations between data bags using DataFu UDFs.

Handling Duplicates

It is often the case that you want to determine the unique set of values in a table or relation (i.e., you want to remove duplicate values and retain only unique records). For instance, if you were creating a set of labels that describe items in an inventory, you would only want to see each label once in the final output, which you might use for a web page’s autocomplete form.

The DISTINCT operator in Pig performs this operation.

Eliminating Duplicate ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required