Chapter 9. Duplicate and Unique Records

This chapter will cover the data processing of duplicate and unique records. We define duplicate records as those with the same value in the same field across two or more records. Unique records are those for which, for the value of a given field, no other records have the same value. Note that in each case we must describe which field(s) we mean when we say unique or duplicate. Pig is no different: by default, the DISTINCT command uses all fields, but we can trim fields from data relations to evaluate uniqueness in different ways, in terms of different fields.

We often find ourselves dealing with multiple records for a given concept or entity. At those times, we may want to reduce our data to just one, unique instance of each key. We’ll introduce the operations UNION and DISTINCT, and various DataFu user-defined functions (UDFs) that achieve this operation.

We’ll also introduce set operations among relations using Pig, and set operations between data bags using DataFu UDFs.

Handling Duplicates

It is often the case that you want to determine the unique set of values in a table or relation (i.e., you want to remove duplicate values and retain only unique records). For instance, if you were creating a set of labels that describe items in an inventory, you would only want to see each label once in the final output, which you might use for a web page’s autocomplete form.

The DISTINCT operator in Pig performs this operation.

Eliminating ...

Get Big Data for Chimps now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.