Chapter 4. Introduction to Pig Latin
It is time to dig into Pig Latin. This chapter provides you with the basics of Pig Latin, enough to write your first useful scripts. More advanced features of Pig Latin are covered in Chapter 5.
Preliminary Matters
Pig Latin is a data flow language. Each processing step results in a
new dataset, or relation. In
input = load 'data', input is the name of the relation that results from loading the dataset
data. A relation name is referred to as
an alias. Relation names look like variables, but
they are not. Once made, an assignment is permanent. It is possible to
reuse relation names; for example, this is legitimate:
A=load'NYSE_dividends'(exchange, symbol, date, dividends);A=filterAbydividends>0;A=foreachAgenerateUPPER(symbol);
However, it is not recommended. It looks here as
if you are reassigning A, but really
you are creating new relations called A, and losing track of the old relations called
A. Pig is smart enough to keep up, but
it still is not a good practice. It leads to confusion when trying to read
your programs (which A am I referring
to?) and when reading error messages.
In addition to relation names, Pig Latin also has
field names. They name a field (or column) in a relation. In
the previous snippet of Pig Latin, dividends and symbol are examples of field names. These are somewhat like variables in that they will contain a different value for each record as it passes through the pipeline, but you cannot assign values to ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access