Chapter 5. Introduction to Pig Latin
It is time to dig into Pig Latin. This chapter provides you with the basics of Pig Latin, enough to write your first useful scripts. More advanced features of Pig Latin are covered in Chapter 6.
Preliminary Matters
Pig Latin is a dataflow language. Each processing
step results in a new data set, or relation. In input = load
'data', input is the name of the relation that results
from loading the data set data. A
relation name is referred to as an alias. Relation names look like
variables, but they are not. Once made, an assignment is permanent. It is
possible to reuse relation names; for example, this is legitimate:
A = load 'NYSE_dividends' (exchange, symbol, date, dividends); A = filter A by dividends > 0; A = foreach A generate UPPER(symbol);
However, it is not recommended. It looks here as
if you are reassigning A, but really you are creating
new relations called A, losing track of the old
relations called A. Pig is smart enough to keep up, but
it still is not a good practice. It leads to confusion when trying to read
your programs (which A am I referring
to?) and when reading error messages.
In addition to relation names, Pig Latin also has
field names. They name a field (or column) in a relation. In
the previous snippet of Pig Latin, dividends and symbol are examples of field names. These are somewhat like variables in that they will contain a different value for each record as it passes through the pipeline, but you cannot assign values to them. ...