This chapter will cover the data processing of duplicate and unique records. We define duplicate records as those with the same value in the same field across two or more records. Unique records are those for which, for the value of a given field, no other records have the same value. Note that in each case we must describe which field(s) we mean when we say unique or duplicate. Pig is no different: by default, the
DISTINCT command uses all fields, but we can trim fields from data relations to evaluate uniqueness in different ways, in terms of different fields.
We often find ourselves dealing with multiple records for a given concept or entity. At those times, we may want to reduce our data to just one, unique instance of each key. We’ll introduce the operations
DISTINCT, and various DataFu user-defined functions (UDFs) that achieve this operation.
We’ll also introduce set operations among relations using Pig, and set operations between data bags using DataFu UDFs.
It is often the case that you want to determine the unique set of values in a table or relation (i.e., you want to remove duplicate values and retain only unique records). For instance, if you were creating a set of labels that describe items in an inventory, you would only want to see each label once in the final output, which you might use for a web page’s autocomplete form.
DISTINCT operator in Pig performs this operation.