Before we take a look at the operators that Pig Latin provides, we first need to understand Pig’s data model. This includes Pig’s data types, how it handles concepts such as missing data, and how you can describe your data to Pig.
Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types, which contain other types.
Pig’s scalar types are simple types that appear in most
programming languages. With the exception of bytearray, they are all
represented in Pig interfaces by java.lang
classes, making them easy to work with in UDFs:
- int
An integer. Ints are represented in interfaces by
java.lang.Integer
. They store a four-byte signed integer. Constant integers are expressed as integer numbers, for example,42
.- long
A long integer. Longs are represented in interfaces by
java.lang.Long
. They store an eight-byte signed integer. Constant longs are expressed as integer numbers with anL
appended, for example,5000000000L
.- float
A floating-point number. Floats are represented in interfaces by
java.lang.Float
and use four bytes to store their value. You can find the range of values representable by Java’sFloat
type at http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.3. Note that because this is a floating-point number, in some calculations it will lose precision. For calculations that require no loss of precision, you should use an int or long instead. Constant floats are expressed as a floating-point number with anf
appended. Floating-point numbers can be expressed in simple format,3.14f
, or in exponent format,6.022e23f
.- double
A double-precision floating-point number. Doubles are represented in interfaces by
java.lang.Double
and use eight bytes to store their value. You can find the range of values representable by Java’sDouble
type at http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.3. Note that because this is a floating-point number, in some calculations it will lose precision. For calculations that require no loss of precision, you should use an int or long instead. Constant doubles are expressed as a floating-point number in either simple format,2.71828
, or in exponent format,6.626e-34
.- chararray
A string or character array. Chararrays are represented in interfaces by
java.lang.String
. Constant chararrays are expressed as string literals with single quotes, for example,'fred'
. In addition to standard alphanumeric and symbolic characters, you can express certain characters in chararrays by using backslash codes, such as\t
for Tab and\n
for Return. Unicode characters can be expressed as\u
followed by their four-digit hexadecimal Unicode value. For example, the value for Ctrl-A is expressed as\u0001
.- bytearray
A blob or array of bytes. Bytearrays are represented in interfaces by a Java class
DataByteArray
that wraps a Javabyte[]
. There is no way to specify a constant bytearray.
Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of any type, including other complex types. So it is possible to have a map where the value field is a bag, which contains a tuple where one of the fields is a map.
A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex type. The chararray is called a key and is used as an index to find the element, referred to as the value.
Because Pig does not know the type of the value, it will assume it is a bytearray. However, the actual value might be something different. If you know what the actual type is (or what you want it to be), you can cast it; see Casts. If you do not cast the value, Pig will make a best guess based on how you use the value in your script. If the value is of a type other than bytearray, Pig will figure that out at runtime and handle it. See Schemas for more information on how Pig handles unknown types.
By default there is no requirement that all
values in a map must be of the same type. It is legitimate to have a
map with two keys name
and age
, where the value for name
is a chararray and the value for
age
is an int. Beginning in
Pig 0.9, a map can declare its values to all be of the
same type. This is useful if you know all values in the map will be of
the same type, as it allows you to avoid the casting, and Pig can
avoid the runtime type-massaging referenced in the previous
paragraph.
Map constants are formed using brackets to
delimit the map, a hash between keys and
values, and a comma between key-value pairs. For example,
['name'#'bob', 'age'#55]
will
create a map with two keys, “name” and
“age”. The first value is a chararray, and the second is
an integer.
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is analogous to a row in SQL, with the fields being SQL columns. Because tuples are ordered, it is possible to refer to the fields by position; see Expressions in foreach for details. A tuple can, but is not required to, have a schema associated with it that describes each field’s type and provides a name for each field. This allows Pig to check that the data in the tuple is what the user expects, and it allows the user to reference the fields of the tuple by name.
Tuple constants use parentheses to indicate
the tuple and commas to delimit fields in the tuple. For example,
('bob', 55)
describes a tuple
constant with two fields.
A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference tuples in a bag by position. Like tuples, a bag can, but is not required to, have a schema associated with it. In the case of a bag, the schema describes all tuples within the bag.
Bag constants are constructed using braces,
with tuples in the bag separated by commas. For example, {('bob', 55), ('sally', 52), ('john', 25)}
constructs a bag with three tuples, each with two fields.
Pig users often notice that Pig does not provide a list or set type that can store items of any type. It is possible to mimic a set type using the bag, by wrapping the desired type in a tuple of one field. For instance, if you want to store a set of integers, you can create a bag with a tuple with one field, which is an int. This is a bit cumbersome, but it works.
Bag is the one type in Pig that is not required to fit into memory. As you will see later, because bags are used to store collections when grouping, bags can become quite large. Pig has the ability to spill bags to disk when necessary, keeping only partial sections of the bag in memory. The size of the bag is limited to the amount of local disk available for spilling the bag.
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it. See foreach, Group, and Join for details of how nulls are handled in expressions and relations in Pig.
Unlike SQL, Pig does not have a notion of constraints on the data. In the context of nulls, this means that any data element can always be null. As you write Pig Latin scripts and UDFs, you will need to keep this in mind.
Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of eating anything. If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization. But if no schema is available, Pig will still process the data, making the best guesses it can based on how the script treats the data. First, we will look at ways that you can communicate the schema to Pig; then, we will examine how Pig handles the case where you do not provide it with the schema.
The easiest way to communicate the schema of your data to Pig is to explicitly tell Pig what it is when you load the data:
dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividend:float);
Pig now expects your data to have four fields. If it has more, it will truncate the extra ones. If it has less, it will pad the end of the record with nulls.
It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray:
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
Warning
You would expect that this also would force your data into a tuple with four fields, regardless of the number of actual input fields, just like when you specify both names and types for the fields. And in Pig 0.9 this is what happens. But in 0.8 and earlier versions it does not; no truncation or null padding is done in the case where you do not provide explicit types for the fields.
Also, when you declare a schema, you do not have to declare the schema of complex types, but you can if you want to. For example, if your data has a tuple in it, you can declare that field to be a tuple without specifying the fields it contains. You can also declare that field to be a tuple that has three columns, all of which are integers. Table 4-1 gives the details of how to specify each data type inside a schema declaration.
Table 4-1. Schema syntax
The runtime declaration of schemas is very nice. It makes it easy for users to operate on data without having to first load it into a metadata system. It also means that if you are interested in only the first few fields, you only have to declare those fields.
But for production systems that run over the same data every hour or every day, it has a couple of significant drawbacks. One, whenever your data changes, you have to change your Pig Latin. Two, although this works fine on data with 5 columns, it is painful when your data has 100 columns. To address these issues, there is another way to load schemas in Pig.
If the load function you are using already knows
the schema of the data, the function can communicate that to Pig. (Load
functions are how Pig reads data; see Load for
details.) Load functions might already know the schema because it is
stored in a metadata repository such as HCatalog, or it might be stored in
the data itself (if, for example, the data is stored in JSON format). In this case, you do not have to declare the
schema as part of the load
statement. And you can still refer
to fields by name because Pig will fetch the schema from the load function
before doing error checking on your script:
mdata = load 'mydata' using HCatLoader(); cleansed = filter mdata by name is not null; ...
But what happens when you cross the streams? What if you specify a schema and the loader returns one? If they are identical, all is well. If they are not identical, Pig will determine whether it can adapt the one returned by the loader to match the one you gave. For example, if you specified a field as a long and the loader said it was an int, Pig can and will do that cast. However, if it cannot determine a way to make the loader’s schema fit the one you gave, it will give an error. See Casts for a list of casts Pig can and cannot insert to make the schemas work together.
Now let’s look at the case where neither you nor
the load function tell Pig what the data’s schema is. In addition to being
referenced by name, fields can be referenced by position, starting from
zero. The syntax is a dollar sign, then the position: $0
refers to the first field. So it is easy to
tell Pig which field you want to work with. But how does Pig know the data
type? It does not, so it starts by assuming everything is a bytearray.
Then it looks at how you use those fields in your script, drawing
conclusions about what you think those fields are and how you want to use
them. Consider the following:
--no_schema.pig daily = load 'NYSE_daily'; calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
In the expression $7 /
1000
, 1000
is an integer, so
it is a safe guess that the eighth field of NYSE_daily is an integer or something that can
be cast to an integer. In the same way, $3 *
100.0
indicates $3
is a
double, and the use of $0
in a function
that takes a chararray as an argument indicates the type of $0
. But what about the last expression, $6 - $3
? The -
operator is used
only with numeric types in Pig, so Pig can safely guess that $3
and $6
are
numeric. But should it treat them as integers or floating-point numbers?
Here Pig plays it safe and guesses that they are floating points, casting
them to doubles. This is the safer bet because if they actually are
integers, those can be represented as floating-point numbers, but the
reverse is not true. However, because floating-point arithmetic is much
slower and subject to loss of precision, if these values really are
integers, you should cast them so that Pig uses integer types in this
case.
There are also cases where Pig cannot make any intelligent guess:
--no_schema_filter daily = load 'NYSE_daily'; fltrd = filter daily by $6 > $3;
>
is a valid operator on numeric,
chararray, and bytearray types in Pig Latin. So, Pig has no way to make a
guess. In this case, it treats these fields as if they were bytearrays,
which means it will do a byte-to-byte comparison of the data in these
fields.
Pig also has to handle the case where it guesses wrong and must adapt on the fly. Consider the following:
--unintended_walks.pig player = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]); unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';
Because the values in maps can be of any type, Pig
has no idea what type bat#'base_on_balls'
and bat#'ibbs'
are. By the rules laid out
previously, Pig will assume they are doubles. But let’s say they actually
turn out to be represented internally as integers.[5] In that case, Pig will need to adapt at runtime and convert
what it thought was a cast from bytearray to double into a cast from int
to double. Note that it will still produce a double output and not an int
output. This might seem nonintuitive; see How Strongly Typed Is Pig?
for details on why this is so. It should be noted that in Pig 0.8 and earlier, much of this runtime adaption code was
shaky and often failed. In 0.9, much of this has been fixed. But if you
are using an older version of Pig, you might need to cast the data
explicitly to get the right results.
Finally, Pig’s knowledge of the schema can change
at different points in the Pig Latin script. In all of the previous
examples where we loaded data without a schema and then passed it to a
foreach
statement, the data started out without a schema. But
after the foreach
, the schema is known. Similarly, Pig can
start out knowing the schema, but if the data is mingled with other data
without a schema, the schema can be lost. That is, lack of schema is
contagious:
--no_schema_join.pig divs = load 'NYSE_dividends' as (exchange, stock_symbol, date, dividends); daily = load 'NYSE_daily'; jnd = join divs by stock_symbol, daily by $1;
In this example, because Pig does not know the
schema of daily
, it cannot know the
schema of the join of divs
and daily
.
The previous sections have referenced casts in Pig without bothering to define how casts work. The syntax for casts in Pig is the same as in Java—the type name in parentheses before the value:
--unintended_walks_cast.pig player = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]); unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
The syntax for specifying types in casts is exactly the same as specifying them in schemas, as shown previously in Table 4-1.
Not all conceivable casts are allowed in Pig. Table 4-2 describes which casts are allowed between scalar types. Casts to bytearrays are never allowed because Pig does not know how to represent the various data types in binary format. Casts from bytearrays to any type are allowed. Casts to and from complex types currently are not allowed, except from bytearray, although conceptually in some cases they could be.
Table 4-2. Supported casts
To int | To long | To float | To double | To chararray | |
---|---|---|---|---|---|
From int | Yes. | Yes. | Yes. | Yes. | |
From long | Yes. Any values greater than 231 or less than –231 will be truncated. | Yes. | Yes. | Yes. | |
From float | Yes. Values will be truncated to int values. | Yes. Values will be truncated to long values. | Yes. | Yes. | |
From double | Yes. Values will be truncated to int values. | Yes. Values will be truncated to long values. | Yes. Values with precision beyond what float can represent will be truncated. | Yes. | |
From chararray | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. |
One type of casting that requires special
treatment is casting from bytearray to other types. Because bytearray
indicates a string of bytes, Pig does not know how to convert its
contents to any other type. Continuing the previous example, both
bat#'base_on_balls'
and bat#'ibbs'
were loaded as bytearrays. The
casts in the script indicate that you want them treated as ints.
Pig does not know whether integer values in baseball are stored as ASCII strings, Java serialized values, binary-coded decimal, or some other format. So it asks the load function, because it is that function’s responsibility to cast bytearrays to other types. In general this works nicely, but it does lead to a few corner cases where Pig does not know how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know how to perform casts on it because that bytearray is not generated by a load function.
Before leaving the topic of casts, we need to consider cases where Pig inserts casts for the user. These casts are implicit, compared to explicit casts where the user indicates the cast. Consider the following:
--total_trade_estimate.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); rough = foreach daily generate volume * close;
In this case, Pig will change the second line to
(float)volume * close
to do the
operation without losing precision. In general, Pig will always widen
types to fit when it needs to insert these implicit casts. So, int and
long together will result in a long; int or long and float will result
in a float; and int, long, or float and double will result in a double.
There are no implicit casts between numeric types and chararrays or
other types.
[5] That is not the case in the example data. For that to be the
case, you would need to use a loader that did load the bat
map with these values as
integers.
Get Programming Pig now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.