Appendix A. Built-in User Defined Functions and
Piggybank
This appendix covers UDFs that come as part of the
Pig distribution, including built-in UDFs and user-contributed UDFs in
Piggybank.
Pig comes prepackaged with many UDFs that can be used
directly in Pig without using register
or
define
. These include load, store, evaluation, and filter
functions.
Built-in Load and Store Functions
Pig’s built-in load functions are listed in Table A-1; Table A-2 lists the
store functions.
Table A-1. Load functions
Function | Location String indicates | Constructor arguments | Description |
---|
HBaseStorage | HBase table | The first argument is a string
describing column family and column to Pig field mapping.
The second is an option string
(optional). | Load data from HBase (see HBase). |
PigStorage | HDFS file | The first argument is a field separator (optional;
defaults to Tab). | Load text data from HDFS (see Load). |
TextLoader | HDFS file | None. | Reads lines of text, each line as a tuple with one
chararray field. |
Table A-2. Store functions
Function | Location String indicates | Constructor arguments | Description |
---|
HBaseStorage | HBase table | The first argument is a string
describing Pig field to HBase column family and column
mapping. The second is an option
string (optional). | Store data to HBase (see HBase). |
PigStorage | HDFS file | The first argument is a field separator (optional;
defaults to Tab). | Store text to HDFS in text format (see Store). |
Built-in Evaluation and Filter Functions
The evaluation functions can be divided into math
functions that mimic many of the Java math functions; aggregate functions that take a bag of values and produce
a single result; functions that operate on or produce complex types; chararray and bytearray functions; filter
functions; and miscellaneous functions.
Each of the built-in evaluation and filter
functions is discussed in the following lists. In these lists, for
brevity, a bag of tuples with a given type is specified by braces
surrounding parentheses and a list of the tuples’ fields. For example, a
bag of tuples with one integer field is denoted as {(int)}
.
double ABS(double
input
)
- Parameter:
input
- Returns:
Absolute value
- Since version:
0.8
double ACOS(double
input
)
- Parameter:
input
- Returns:
Arc cosine
- Since version:
0.8
double ASIN(double
input
)
- Parameter:
input
- Returns:
Arc sine
- Since version:
0.8
double ATAN(double
input
)
- Parameter:
input
- Returns:
Arc tangent
- Since version:
0.8
double CBRT(double
input
)
- Parameter:
input
- Returns:
Cube root
- Since version:
0.8
double CEIL(double
input
)
- Parameter:
input
- Returns:
Next-highest double value that is a
mathematical integer
- Since version:
0.8
double COS(double
input
)
- Parameter:
input
- Returns:
Cosine
- Since version:
0.8
double COSH(double
input
)
- Parameter:
input
- Returns:
Hyperbolic cosine
- Since version:
0.8
double EXP(double
input
)
- Parameter:
input
- Returns:
Euler’s number
(e) raised to the power of
input
- Since version:
0.8
double FLOOR(double
input
)
- Parameter:
input
- Returns:
Next-lowest double value that is a
mathematical integer
- Since version:
0.8
double LOG(double
input
)
- Parameter:
input
- Returns:
Natural logarithm of
input
- Since version:
0.8
double LOG10(double
input
)
- Parameter:
input
- Returns:
Logarithm base 10 of
input
- Since version:
0.8
long ROUND(double
input
)
- Parameter:
input
- Returns:
Long nearest to the value of
input
- Since version:
0.8
double SIN(double
input
)
- Parameter:
input
- Returns:
Sine
- Since version:
0.8
double SINH(double
input
)
- Parameter:
input
- Returns:
Hyperbolic sine
- Since version:
0.8
double SQRT(double
input
)
- Parameter:
input
- Returns:
Square root
- Since version:
0.8
double TAN(double
input
)
- Parameter:
input
- Returns:
Tangent
- Since version:
0.8
double TANH(double
input
)
- Parameter:
input
- Returns:
Hyperbolic tangent
- Since version:
0.8
int AVG({(int
)}
input
)
- Parameter:
input
- Returns:
Average of all values in
input
; nulls are
ignored
- Since version:
0.2
long AVG({(long)}
input
)
- Parameter:
input
- Returns:
Average of all values in
input
; nulls are
ignored
- Since version:
0.2
float AVG({(float)}
input
)
- Parameter:
input
- Returns:
Average of all values in
input
; nulls are
ignored
- Since version:
0.2
double AVG({(double)}
input
)
- Parameter:
input
- Returns:
Average of all values in
input
; nulls are
ignored
- Since version:
0.2
double AVG({(bytearray)}
input
)
- Parameter:
input
- Returns:
Average of all bytearrays, cast to
doubles, in input
;
nulls are ignored
- Since version:
0.1
long COUNT
A version of COUNT
that matches
SQL semantics for COUNT(col)
- Parameter:
input
- Returns:
Number of records in
input
, excluding
null values
- Since version:
0.1
long COUNT_STAR
A version of COUNT
that matches SQL semantics for COUNT(*)
- Parameter:
input
- Returns:
Number of all records in
input
, including
null values
- Since version:
0.4
int MAX({(int)}
input
)
- Parameter:
input
- Returns:
Maximum value in
input
; nulls are
ignored
- Since version:
0.2
long MAX({(long)}
input
)
- Parameter:
input
- Returns:
Maximum value in
input
; nulls are
ignored
- Since version:
0.2
float MAX({(float)}
input
)
- Parameter:
input
- Returns:
Maximum value in
input
; nulls are
ignored
- Since version:
0.2
double MAX({(double)}
input
)
- Parameter:
input
- Returns:
Maximum value in
input
; nulls are
ignored
- Since version:
0.2
chararray MAX
- Parameter:
input
- Returns:
Maximum value in
input
; nulls are
ignored
- Since version:
0.2
double MAX({(bytearray)}
input
)
- Parameter:
input
- Returns:
Maximum of all bytearrays, cast to
doubles, in input
;
nulls are ignored
- Since version:
0.1
int MIN({(int)}
input
)
- Parameter:
input
- Returns:
Minimum value in
input
; nulls are
ignored
- Since version:
0.2
long MIN({(long)}
input
)
- Parameter:
input
- Returns:
Minimum value in
input
; nulls are
ignored
- Since version:
0.2
float MIN({(float)}
input
)
- Parameter:
input
- Returns:
Minimum value in
input
; nulls are
ignored
- Since version:
0.2
double MIN({(double)}
input
)
- Parameter:
input
- Returns:
Minimum value in
input
; nulls are
ignored
- Since version:
0.2
chararray MIN
- Parameter:
input
- Returns:
Minimum value in
input
; nulls are
ignored
- Since version:
0.2
double MIN({(bytearray)}
input
)
- Parameter:
input
- Returns:
Minimum of all bytearrays, cast to
doubles, in input
;
nulls are ignored
- Since version:
0.1
long SUM({(int)}
input
)
- Parameter:
input
- Returns:
Sum of all values in the bag; nulls
are ignored
- Since version:
0.2
long SUM({(long)}
input
)
- Parameter:
input
- Returns:
Sum of all values in the bag; nulls
are ignored
- Since version:
0.2
double SUM({(float)}
input
)
- Parameter:
input
- Returns:
Sum of all values in the bag; nulls
are ignored
- Since version:
0.2
double SUM({(double)}
input
)
- Parameter:
input
- Returns:
Sum of all values in the bag; nulls
are ignored
- Since version:
0.2
double SUM({(bytearray)}
input
)
- Parameter:
input
- Returns:
Sum of all bytearrays, cast to
doubles, in input
;
nulls are ignored
- Since version:
0.1
Built-in chararray and bytearray UDFs
chararray CONCAT(chararray
c1
, chararray
c2
)
- Parameters:
c1
c2
- Returns:
Concatenation of
c1
and
c2
- Since version:
0.1
bytearray CONCAT(bytearray
b1
, bytearray
b2
)
- Parameters:
b1
b2
- Returns:
Concatenation of
b1
and
b2
- Since version:
0.1
int INDEXOF(chararray
source
, chararray
search
)
- Parameters:
source
:
the chararray to search in
search
:
the chararray to search for
- Returns:
Index of the first instance of
search
in
source
; -1 if
search
is not in
source
- Since version:
0.8
int LAST_INDEX_OF(chararray
source
, chararray
search
)
- Parameters:
source
:
the chararray to search in
search
:
the chararray to search for
- Returns:
Index of the last instance of
search
in
source
; -1 if
search
is not in
source
- Since version:
0.8
chararray LCFIRST(chararray
input
)
- Parameter:
input
- Returns:
input
,
with the first character converted to lowercase
- Since version:
0.8
chararray LOWER(chararray
input
)
- Parameter:
input
- Returns:
input
with all characters converted to lowercase
- Since version:
0.8
chararray REGEX_EXTRACT(chararray
source
, chararray
regex
, int
n
)
- Parameters:
source
:
the chararray to search in
regex
:
the regular expression to search for
n
:
take the nth match, counting from
0
- Returns:
nth subset of
the source
matching regex
;
null if there are no matches
- Since version:
0.8
(chararray) REGEX_EXTRACT_ALL(chararray
source
, chararray
regex
)
- Parameters:
source
:
the chararray to search in
regex
:
the regular expression to search for
- Returns:
Tuple containing all subsets of
source
matching
regex
; null if
there are no matches
- Since version:
0.8
chararray REPLACE(chararray
source
, chararray
toReplace
, chararray
newValue
)
- Parameters:
source
:
the chararray to search in
toReplace
:
the chararray to be replaced
newValue
:
the new chararray to replace it with
- Returns:
source
with all instances of
toReplace
changed
to newValue
- Since version:
0.8
long SIZE(chararray
input
)
- Parameter:
input
- Returns:
Number of characters in
input
- Since version:
0.2
long SIZE(bytearray
input
)
- Parameter:
input
- Returns:
Number of bytes in
input
- Since version:
0.2
(chararray) STRSPLIT(chararray
source
)
Split a chararray by whitespace
- Parameter:
source
:
the chararray to split
- Returns:
Tuple with one field for each
section of
source
- Since version:
0.8
(chararray) STRSPLIT(chararray
source
, chararray
regex
)
Split a chararray by a regular
expression
- Parameters:
source
:
the chararray to split
regex
:
the regular expression to use as the delimiter
- Returns:
Tuple with one field for each
section of
source
- Since version:
0.8
(chararray) STRSPLIT(chararray
source
, chararray
regex
, int
max
splits)
Split a chararray by a regular
expression
- Parameters:
source
:
the chararray to split
regex
:
the regular expression to use as the delimiter
max
:
the maximum number of splits
- Returns:
Tuple with one field for each
section of source
;
if there are more than one
maxsplits
sections, only the first
maxsplits
sections
will be in the tuple
- Since version:
0.8
chararray SUBSTRING(chararray
source
, int
start
, int
end
)
- Parameters:
source
:
the chararray to split
start
:
the start position (inclusive), counting from 0
end
:
the end position (exclusive), counting from 0
- Returns:
Subchararray; error if any input
value has a length shorter than
start
- Since version:
0.8
{(chararray)} TOKENIZE(chararray
input
)
- Parameter:
source
:
the chararray to split
- Returns:
input
split on whitespace, with each resulting value being
placed in its own tuple and all tuples placed in the
bag
- Since version:
0.1
chararray TRIM(chararray
input
)
- Parameter:
input
- Returns:
input
with all leading and trailing whitespace removed
- Since version:
0.8
chararray UCFIRST(chararray
input
)
- Parameter:
input
- Returns:
input
with the first character converted to uppercase
- Since version:
0.8
chararray UPPER(chararray
input
)
- Parameter:
input
- Returns:
input
with all characters converted to uppercase
- Since version:
0.8
Built-in complex type UDFs
{(chararray, chararray, double)}
COR({(double)} b1
, {(double)}
b2
)
Calculate the correlation between two
bags of doubles
- Parameters:
b1
b2
- Returns:
First chararray is the name of
b1
, second
chararray is the name of
b2
, double is the
correlation between
b1
and
b2
- Since version:
0.8
{(chararray, chararray, double)} COV({(double)}
b1
, {(double)}
b2
)
Calculate the covariance of two bags of
doubles
- Parameters:
b1
b2
- Returns:
First chararray is the name of
b1
, second
chararray is the name of
b2
, double is the
covariance of b1
and b2
- Since version:
0.8
bag DIFF(bag b1
,
bag b2
)
- Parameters:
b1
b2
- Returns:
All records from
b1
that are not in
b2
, and all
records from b2
that are not in
b1
- Since version:
0.1
long SIZE(map
input
)
- Parameter:
input
- Returns:
Number of key-value pairs in
input
- Since version:
0.2
long SIZE(tuple
input
)
- Parameter:
input
- Returns:
Number of fields in
input
- Since version:
0.2
long SIZE(bag
input
)
- Parameter:
input
- Returns:
Number of tuples in
input
- Since version:
0.2
bag TOBAG(...)
- Parameter:
Variable
- Returns:
If all inputs have the same schema,
the resulting bag will have that schema, else it will have
a null schema; if the parameters are tuples, all schemas
must have the same field names in addition to types
- Since version:
0.8
map TOMAP(...)
- Parameter:
Variable
- Returns:
Input parameters are paired up and
placed in a map as key/value, key/value; all keys must be
chararrays; an odd number of arguments will result in an
error
- Since version:
0.9
bag TOP(int
numRecords
, int
field
, bag
source
)
- Parameters:
numRecords
:
the number of records to return
field
:
the field to sort on
source
:
the bag to return records from
- Returns:
A bag with
numRecords
- Since version:
0.8
tuple TOTUPLE(...)
- Parameter:
Variable
- Returns:
A tuple with all of the fields
passed in as arguments
- Since version:
0.8
Built-in filter functions
boolean IsEmpty(bag)
- Parameter:
input
- Returns:
Boolean
- Since version:
0.1
boolean IsEmpty(tuple)
- Parameter:
input
- Returns:
Boolean
- Since version:
0.1
Miscellaneous built-in UDF
double RANDOM()
- Returns:
A random double between 0 and
1
- Since version:
0.4
Piggybank is Pig’s repository of
user-contributed functions. Piggybank functions are distributed as part of
the Pig distribution, but they are not built in. You must
register
the Piggybank JAR to use them, which you can do in your
distribution at contrib/piggybank/java/piggybank.jar.
At the time of writing, there is no central website
or set of documentation for Piggybank. To find out what is in there, you
will need to browse through the code. You can see all of the included
functions by looking in your distribution under contrib/piggybank/. Piggybank does not yet
include any Python functions, but it is set up to allow users to
contribute functions in languages other than Java, so hopefully this will
change in time.
Get Programming Pig now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.