This appendix covers UDFs that come as part of the Pig distribution, including built-in UDFs and user-contributed UDFs in Piggybank.

Pig comes prepackaged with many UDFs that can be used
directly in Pig without using `register`

or
`define`

. These include load, store, evaluation, and filter
functions.

Pig’s built-in load functions are listed in Table A-1; Table A-2 lists the store functions.

Table A-1. Load functions

Function | Location String indicates | Constructor arguments | Description |
---|---|---|---|

`HBaseStorage` | HBase table | The first argument is a string describing column family and column to Pig field mapping. | Load data from HBase (see HBase). |

`PigStorage` | HDFS file | The first argument is a field separator (optional; defaults to Tab). | Load text data from HDFS (see Load). |

`TextLoader` | HDFS file | None. | Reads lines of text, each line as a tuple with one chararray field. |

Table A-2. Store functions

Function | Location String indicates | Constructor arguments | Description |
---|---|---|---|

`HBaseStorage` | HBase table | The first argument is a string describing Pig field to HBase column family and column mapping. | Store data to HBase (see HBase). |

`PigStorage` | HDFS file | The first argument is a field separator (optional; defaults to Tab). | Store text to HDFS in text format (see Store). |

The evaluation functions can be divided into math functions that mimic many of the Java math functions; aggregate functions that take a bag of values and produce a single result; functions that operate on or produce complex types; chararray and bytearray functions; filter functions; and miscellaneous functions.

Each of the built-in evaluation and filter
functions is discussed in the following lists. In these lists, for
brevity, a bag of tuples with a given type is specified by braces
surrounding parentheses and a list of the tuples’ fields. For example, a
bag of tuples with one integer field is denoted as `{(int)}`

.

`double ABS(double`

)`input`

`double ACOS(double`

)`input`

`double ASIN(double`

)`input`

`double ATAN(double`

)`input`

`double CBRT(double`

)`input`

`double CEIL(double`

)`input`

`double COS(double`

)`input`

`double COSH(double`

)`input`

`double EXP(double`

)`input`

`double FLOOR(double`

)`input`

`double LOG(double`

)`input`

`double LOG10(double`

)`input`

`long ROUND(double`

)`input`

`double SIN(double`

)`input`

`double SINH(double`

)`input`

`double SQRT(double`

)`input`

`double TAN(double`

)`input`

`double TANH(double`

)`input`

`int AVG({(`

)}`int`

)`input`

`long AVG({(long)}`

)`input`

`float AVG({(float)}`

)`input`

`double AVG({(double)}`

)`input`

`double AVG({(bytearray)}`

)`input`

`long COUNT`

A version of

`COUNT`

that matches SQL semantics for`COUNT(col)`

`long COUNT_STAR`

`int MAX({(int)}`

)`input`

`long MAX({(long)}`

)`input`

`float MAX({(float)}`

)`input`

`double MAX({(double)}`

)`input`

`chararray MAX`

`double MAX({(bytearray)}`

)`input`

`int MIN({(int)}`

)`input`

`long MIN({(long)}`

)`input`

`float MIN({(float)}`

)`input`

`double MIN({(double)}`

)`input`

`chararray MIN`

`double MIN({(bytearray)}`

)`input`

`long SUM({(int)}`

)`input`

`long SUM({(long)}`

)`input`

`double SUM({(float)}`

)`input`

`double SUM({(double)}`

)`input`

`double SUM({(bytearray)}`

)`input`

`chararray CONCAT(chararray`

, chararray`c1`

)`c2`

`bytearray CONCAT(bytearray`

, bytearray`b1`

)`b2`

`int INDEXOF(chararray`

, chararray`source`

)`search`

`int LAST_INDEX_OF(chararray`

, chararray`source`

)`search`

`chararray LCFIRST(chararray`

)`input`

`chararray LOWER(chararray`

)`input`

`chararray REGEX_EXTRACT(chararray`

, chararray`source`

, int`regex`

)`n`

`(chararray) REGEX_EXTRACT_ALL(chararray`

, chararray`source`

)`regex`

`chararray REPLACE(chararray`

, chararray`source`

, chararray`toReplace`

)`newValue`

`long SIZE(chararray`

)`input`

`long SIZE(bytearray`

)`input`

`(chararray) STRSPLIT(chararray`

)`source`

`(chararray) STRSPLIT(chararray`

, chararray`source`

)`regex`

`(chararray) STRSPLIT(chararray`

, chararray`source`

, int`regex`

splits)`max`

`chararray SUBSTRING(chararray`

, int`source`

, int`start`

)`end`

`{(chararray)} TOKENIZE(chararray`

)`input`

`chararray TRIM(chararray`

)`input`

`chararray UCFIRST(chararray`

)`input`

`chararray UPPER(chararray`

)`input`

`{(chararray, chararray, double)} COR({(double)}`

, {(double)}`b1`

)`b2`

`{(chararray, chararray, double)} COV({(double)}`

, {(double)}`b1`

)`b2`

`bag DIFF(bag`

, bag`b1`

)`b2`

`long SIZE(map`

)`input`

`long SIZE(tuple`

)`input`

`long SIZE(bag`

)`input`

`bag TOBAG(...)`

`map TOMAP(...)`

`bag TOP(int`

, int`numRecords`

, bag`field`

)`source`

`tuple TOTUPLE(...)`

*Piggybank* is Pig’s repository of
user-contributed functions. Piggybank functions are distributed as part of
the Pig distribution, but they are not built in. You must
`register`

the Piggybank JAR to use them, which you can do in your
distribution at *contrib/piggybank/java/piggybank.jar*.

At the time of writing, there is no central website
or set of documentation for Piggybank. To find out what is in there, you
will need to browse through the code. You can see all of the included
functions by looking in your distribution under *contrib/piggybank/*. Piggybank does not yet
include any Python functions, but it is set up to allow users to
contribute functions in languages other than Java, so hopefully this will
change in time.

Get *Programming Pig* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.