Skip to main content

Get full access to Programming Pig and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Start your free trial

Index

A note on the digital index

A link in an index entry is displayed as the section title in which that entry appears. Because some sections have multiple index markers, it is not unusual for an entry to have several links to the same section. Clicking on any link will take you directly to the place in the text in which the marker appears.

Symbols

!= inequality operator, Filter
# dereference operator for maps, Map
$ macro parameter, Macros
$ parameter substitution target, Parameter Substitution
% modulo operator, Expressions in foreach
() tuple parentheses, Dump
* all fields, Expressions in foreach
* multiplication operator, Expressions in foreach
* zero or more characters glob, Load
+ addition operator, Expressions in foreach
- subtraction operator, Expressions in foreach
- unary negative operator, Expressions in foreach
-- single line comment operator, Comments
.. range of fields, Expressions in foreach
/ division operator, Expressions in foreach
/* */ multiline comment operator, Comments
< inequality operator, Filter
<= inequality operator, Filter
== equality operator, Filter
> inequality operator, Filter
>= inequality operator, Filter
? any character glob, Load
? bincond operator, Expressions in foreach
[] map brackets, Dump
\ escape character, Load
{} bag braces, Dump
{} macro operator, Macros

A

ABS function, Built-in math UDFs
accumulator interface, Accumulator Interface
ACID, NoSQL Databases
ACOS function, Built-in math UDFs
AddForEach optimization, Debugging Tips
algebraic calculations, Group, Algebraic Interface
algebraic interface, Algebraic Interface–Algebraic Interface
aliases, Preliminary Matters, define and UDFs
Amazon Elastic MapReduce (EMR), Pig’s History, Running Pig in the Cloud
Apache HBase, HBase–HBase
Apache HCatalog, Metadata in Hadoop
Apache Hive, Pig and Hive
Apache open source, What Is Pig?, Downloading the Pig Package from Apache
arithmetic operators, Expressions in foreach
as clause (load function), Load, Naming fields in foreach
as clause (stream command), stream
ASIN function, Built-in math UDFs
ATAN function, Built-in math UDFs
AVG functions, Built-in aggregate UDFs

B

bad records, handling, Bad Record Handling

bag data type, Bag, Schemas, Interacting with Pig values, Memory Issues in Eval Funcs, Python UDFs

bag DIFF function, Built-in complex type UDFs

bag projection, Expressions in foreach

bag TOBAG function, Built-in complex type UDFs

bag TOP function, Built-in complex type UDFs

BagFactory class, Interacting with Pig values

baseball examples, Code Examples in This Book, Schemas, Expressions in foreach, Registering Python UDFs, flatten, Nonlinear Data Flows

base on balls and IBBs, Schemas
batting average, Expressions in foreach
data set, Code Examples in This Book, flatten
players by position and team, Nonlinear Data Flows
slugging percentage, Registering Python UDFs

behavior prediction models, What Is Pig Useful For?

binary condition operator, Expressions in foreach

bind call, Bind

bindings, multiple, Binding Multiple Sets of Variables, Running Multiple Bindings

boolean IsEmpty functions, Built-in filter functions

Boolean operators, Filter

bottlenecks, Making Pig Fly

built-in aggregate UDFs, Built-in aggregate UDFs–Built-in aggregate UDFs

built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs–Built-in chararray and bytearray UDFs

built-in complex type UDFs, Built-in complex type UDFs–Built-in complex type UDFs

built-in filter functions, Built-in filter functions

built-in load and store functions, Built-in Load and Store Functions

built-in math UDFs, Built-in math UDFs

bytearray CONCAT functions, Built-in chararray and bytearray UDFs

bytearray type, Scalar Types, Schemas, Choose the Right Data Type, Python UDFs, Casting bytearrays

C

cache clause (define statement), stream

caching option (HBase), HBase

Cascading, Cascading

case sensitivity, Case Sensitivity, User Defined Functions, Writing an Evaluation Function in Java

Pig Latin, Case Sensitivity
UDF names, User Defined Functions, Writing an Evaluation Function in Java

Cassandra, Apache, Cassandra

Cassandra: The Definitive Guide (Hewitt), Cassandra

caster option (HBase), HBase

casts, Casts–Casts, Getting the casting functions, Casting bytearrays

cat command, HDFS Commands in Grunt, Order by

CBRT function, Built-in math UDFs

CEIL function, Built-in math UDFs

chararray functions, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs

CONCAT, Built-in chararray and bytearray UDFs
LCFIRST, Built-in chararray and bytearray UDFs
LOWER, Built-in chararray and bytearray UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
REGEX_EXTRACT, Built-in chararray and bytearray UDFs
REGEX_EXTRACT_ALL, Built-in chararray and bytearray UDFs
REPLACE, Built-in chararray and bytearray UDFs
STRSPLIT, Built-in chararray and bytearray UDFs
SUBSTRING, Built-in chararray and bytearray UDFs
TOKENIZE, Built-in chararray and bytearray UDFs
TRIM, Built-in chararray and bytearray UDFs
UCFIRST, Built-in chararray and bytearray UDFs
UPPER, Built-in chararray and bytearray UDFs

chararray type, Scalar Types, Schemas, Filter, Python UDFs

checking syntax, Syntax Highlighting and Checking

Cloud computing, Running Pig in the Cloud

Cloudera, downloading Pig from, Downloading Pig from Cloudera

cluster, Running Pig on Your Hadoop Cluster, Using Compression in Intermediate Results

running Pig on your, Running Pig on Your Hadoop Cluster
setting up LZO on your, Using Compression in Intermediate Results

cogroup operator, Parallel, cogroup, Nonlinear Data Flows, Setting the Partitioner, explain, explain, Filter Early and Often

columnMapKeyPrune optimization, Debugging Tips

combiner phase, Group, Algebraic Interface, Combiner Phase

combiner, turning off, Debugging Tips

command tab completion, Grunt

command-line options, Command-Line and Configuration Options

comment operators (Pig Latin), Comments

compile method, Compile

complex data types, Complex Types–Nulls, Evaluation Function Basics, Input and Output Schemas, Built-in Evaluation and Filter Functions, Built-in complex type UDFs

compression, using in intermediate results, Using Compression in Intermediate Results

CONCAT functions, Built-in chararray and bytearray UDFs

constructors, Constructors and Passing Data from Frontend to Backend–UDFContext

controlling execution, Controlling Execution

copyFromLocal command, HDFS Commands in Grunt

copyToLocal command, HDFS Commands in Grunt

COR function, Built-in complex type UDFs

corrupted data, handling, Bad Record Handling

COS function, Built-in math UDFs

COSH function, Built-in math UDFs

COUNT function, Evaluation Function Basics, Algebraic Interface, Algebraic Interface, Accumulator Interface, Built-in aggregate UDFs

COUNT_STAR function, Built-in aggregate UDFs

COV function, Built-in complex type UDFs

cross operator, Parallel, cross–cross, Nonlinear Data Flows, Setting the Partitioner, Filter Early and Often

D

-D passing properties, Command-Line and Configuration Options

DAG (directed acyclic graph), Pig Latin, a Parallel Dataflow Language, Nonlinear Data Flows

data, What Is Pig Useful For?, Types–Nulls, Debugging Tips, Choose the Right Data Type, Data Layout Optimization, Constructors and Passing Data from Frontend to Backend, Writing Data–Writing records, Pig and Hive, Metadata in Hadoop

layout optimization, Data Layout Optimization
passing, Constructors and Passing Data from Frontend to Backend
pipelines, What Is Pig Useful For?, Debugging Tips, Pig and Hive, Metadata in Hadoop
types, Types–Nulls, Choose the Right Data Type
writing, Writing Data–Writing records

data sets, example, Code Examples in This Book

dataflow languages, Pig Latin, a Parallel Dataflow Language, Embedding Pig Latin in Python

DataNodes, Loading the distributed cache, Distributed Cache, Hadoop Distributed File System

debugging, Debugging Tips

%declare, Parameter Substitution

declaring, Schemas, Nonlinear Data Flows, Macros, Choose the Right Data Type, Input and Output Schemas, Constructors and Passing Data from Frontend to Backend

a filename, Constructors and Passing Data from Frontend to Backend
a macro, Macros
a schema, Schemas, Input and Output Schemas
a type, Nonlinear Data Flows, Choose the Right Data Type

%default, Parameter Substitution

define statement, Registering UDFs, define and UDFs, stream, Macros, Constructors and Passing Data from Frontend to Backend

define utility method, Utility Methods

describe operator, describe

development tools, Development Tools–Debugging Tips

DeWitt, David J., Joining skewed data

DIFF function, Built-in complex type UDFs

directed acyclic graph (DAG), Pig Latin, a Parallel Dataflow Language, Nonlinear Data Flows

distinct operator, Distinct, Parallel, Nested foreach, Nested foreach, Setting the Partitioner, Filter Early and Often

distributed cache, Joining small to large data, stream, Loading the distributed cache, Distributed Cache

distributive calculations, Group, Algebraic Interface

double functions, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in math UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Miscellaneous built-in UDF

ABS, Built-in math UDFs
ACOS, Built-in math UDFs
ASIN, Built-in math UDFs
ATAN, Built-in math UDFs
AVG, Built-in aggregate UDFs
CBRT, Built-in math UDFs
CEIL, Built-in math UDFs
COS, Built-in math UDFs
COSH, Built-in math UDFs
EXP, Built-in math UDFs
FLOOR, Built-in math UDFs
LOG, Built-in math UDFs
LOG10, Built-in math UDFs
MAX, Built-in aggregate UDFs, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
RANDOM, Miscellaneous built-in UDF
SIN, Built-in math UDFs
SINH, Built-in math UDFs
SQRT, Built-in math UDFs
SUM, Built-in aggregate UDFs
TAN, Built-in math UDFs
TANH, Built-in math UDFs

double type, Scalar Types, Schemas, Python UDFs

-dryrun command line option, Macros, Syntax Highlighting and Checking

dump statement, Dump

E

Eclipse syntax highlighting, Syntax Highlighting and Checking

Elastic MapReduce (EMR), Running Pig in the Cloud

Emacs syntax highlighting, Syntax Highlighting and Checking

embedding Pig Latin in Python, Embedding Pig Latin in Python–Utility Methods

EMR (Elastic MapReduce), Amazon, Running Pig in the Cloud

equality operators, Filter

errors, How Pig differs from MapReduce, Entering Pig Latin Scripts in Grunt, Schemas, Schemas, Order by, union, explain, Run, Input and Output Schemas, Error Handling and Progress Reporting, Reading records, Failure Cleanup, Handling Failure

checking in Grunt, Entering Pig Latin Scripts in Grunt
debugging with explain, explain
in evaluation functions, Error Handling and Progress Reporting
failure cleanup, Failure Cleanup, Handling Failure
getErrorMessage function, Run
parse, Reading records
in Pig Latin scripts, How Pig differs from MapReduce
runtime exceptions, Input and Output Schemas
schema, Schemas, Schemas, union
sorting by maps, tuples, bags, Order by

escape characters (Unix shell command line), Load

ETL (extract transform load) data pipelines, What Is Pig Useful For?

evaluation functions, UDFs in foreach, Writing an Evaluation Function in Java, Where Your UDF Will Run, Evaluation Function Basics, Input and Output Schemas–Input and Output Schemas, Error Handling and Progress Reporting, Memory Issues in Eval Funcs, Built-in Evaluation and Filter Functions–Miscellaneous built-in UDF

basics, UDFs in foreach, Evaluation Function Basics
built-in, Built-in Evaluation and Filter Functions–Miscellaneous built-in UDF
error handling and progress reporting, Error Handling and Progress Reporting
input and output schemas, Input and Output Schemas–Input and Output Schemas
memory issues in, Memory Issues in Eval Funcs
where your UDF will run, Where Your UDF Will Run
writing in Java, Writing an Evaluation Function in Java

examples, Code Examples in This Book, MapReduce’s hello world, MapReduce’s hello world, MapReduce’s hello world, MapReduce’s hello world, Comparing query and dataflow languages, How Pig differs from MapReduce, Running Pig Locally on Your Machine, Running Pig on Your Hadoop Cluster, Expressions in foreach, Joining small to large data, Joining skewed data, cross, cross, stream–mapreduce, stream–mapreduce, Embedding Pig Latin in Python–Utility Methods, Constructors and Passing Data from Frontend to Backend–Loading the distributed cache, Writing Load and Store Functions, Writing Load and Store Functions, Store Functions–Store Functions and UDFContext, Storing Metadata, HBase

(see also baseball examples)
(see also NYSE examples)
blacklisting URLs, stream–mapreduce
calculating page rank from web crawl, Code Examples in This Book, stream–mapreduce, Embedding Pig Latin in Python–Utility Methods
determining metropolitan area, cross
finding the top five URLs, How Pig differs from MapReduce
group then join in SQL and Pig Latin, Comparing query and dataflow languages
HBase table, HBase
“hello world”, MapReduce’s hello world
JsonLoader, Writing Load and Store Functions
JsonStorage, Writing Load and Store Functions
MetroResolver, Constructors and Passing Data from Frontend to Backend–Loading the distributed cache
running Pig in local mode, Running Pig Locally on Your Machine
running Pig on your cluster, Running Pig on Your Hadoop Cluster
store function, Store Functions–Store Functions and UDFContext, Storing Metadata
user distribution by city, Joining skewed data, cross
word count, MapReduce’s hello world
ZIP code lookup, Joining small to large data

exec command, Controlling Pig from Grunt

-execute (-e) command-line option, Command-Line and Configuration Options

EXP function, Built-in math UDFs

explain operator, explain–explain

explicit splits, Nonlinear Data Flows

F

failure cleanup, Failure Cleanup, Handling Failure

fields, Preliminary Matters

FileOutputFormat, Setting the output location

filesystem operations, Utility Methods

filter functions, Filter, define and UDFs, Writing Evaluation and Filter Functions, Writing Filter Functions, Built-in filter functions

filter operator, How Pig differs from MapReduce, Filter–Filter, Nested foreach, Writing Evaluation and Filter Functions, Writing Filter Functions, Using partitions, Metadata in Hadoop

filters, Debugging Tips, Debugging Tips, Debugging Tips, Filter Early and Often

MergeFilter optimization, Debugging Tips
pushing, Filter Early and Often
PushUpFilter optimization, Debugging Tips
SplitFilter optimization, Debugging Tips

Finding the Top Five URLs example, How Pig differs from MapReduce

flatten statement, flatten–flatten

float functions, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs

AVG, Built-in aggregate UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs

float type, Scalar Types, Schemas, Python UDFs

FLOOR function, Built-in math UDFs

foreach operator, foreach, UDFs in foreach, Advanced Features of foreach–Nested foreach, explain, Filter Early and Often

fragment-replicate join, Joining small to large data

frontend planning functions, Frontend Planning Functions–Passing Information from the Frontend to the Backend, Store Function Frontend Planning–Store Functions and UDFContext

frontend/backend invocation, Constructors and Passing Data from Frontend to Backend–UDFContext

fs keyword, HDFS Commands in Grunt

fuzzy joins, cross

G

gateway machine, Running Pig on Your Hadoop Cluster

Gaussian distribution, Group

getAllErrorMessages method, Run

getBytesWritten method, Run

getDuration method, Run

getErrorMessage method, Run

getNumberBytes method, Run

getNumberJobs method, Run

getNumberRecords method, Run

getOutputFormat method, Determining OutputFormat

getOutputLocations, getOutputNames methods, Run

getRecordWritten method, Run

getReturnCode method, Run

getUDFContext method, UDFContext

Global Rearrange operator, explain

globs, Load

GNU Public License (GPL) for LZO, Using Compression in Intermediate Results

group by clause, Group–Group

group by operator, How Pig differs from MapReduce

group operator, Group–Group, Parallel, Nonlinear Data Flows, Setting the Partitioner, Filter Early and Often, Evaluation Function Basics

“Group then join in SQL and Pig Latin” example, Comparing query and dataflow languages

Grunt, Grunt, Entering Pig Latin Scripts in Grunt, HDFS Commands in Grunt, Controlling Pig from Grunt, explain

controlling Pig from, Controlling Pig from Grunt
entering Pig Latin scripts in, Entering Pig Latin Scripts in Grunt
explain Pig Latin script in, explain
HDFS commands in, HDFS Commands in Grunt

gt option (HBase), HBase

gte option (HBase), HBase

gzip compression type, Using Compression in Intermediate Results

H

-h properties command-line option, Command-Line and Configuration Options

Hadoop, Pig on Hadoop, Running Pig on Your Hadoop Cluster, Command-Line and Configuration Options, HDFS Commands in Grunt, HDFS Commands in Grunt, Tune Pig and Hadoop for Your Job, Using Compression in Intermediate Results, Constructors and Passing Data from Frontend to Backend–Loading the distributed cache, Writing Load and Store Functions–Determining the location, Metadata in Hadoop, Overview of Hadoop–Hadoop Distributed File System, Hadoop Distributed File System

fs shell commands, HDFS Commands in Grunt
HDFS (Hadoop Distributed File System), Pig on Hadoop, HDFS Commands in Grunt, Constructors and Passing Data from Frontend to Backend–Loading the distributed cache, Writing Load and Store Functions–Determining the location, Hadoop Distributed File System
Java properties used, Command-Line and Configuration Options
metadata in, Metadata in Hadoop
overview, Overview of Hadoop–Hadoop Distributed File System
running Pig on your cluster, Running Pig on Your Hadoop Cluster
tarball, Using Compression in Intermediate Results
tuning, Tune Pig and Hadoop for Your Job

hadoop-site.xml file, Running Pig on Your Hadoop Cluster

Hadoop: The Definitive Guide (White), Tune Pig and Hadoop for Your Job, Overview of Hadoop

handling failure, Handling Failure

hashCode function, Shuffle Phase

HashPartitioner, Shuffle Phase

HBase, Apache, HBase–HBase

HBaseStorage function, Getting the casting functions, HBase–HBase, Built-in Load and Store Functions, Built-in Load and Store Functions

HCatalog, Apache, Metadata in Hadoop

HCatLoader, Using partitions, Pushing down projections

heap size, Joining skewed data, Tune Pig and Hadoop for Your Job, Memory Issues in Eval Funcs

hello world example, MapReduce’s hello world

-help (-h) command-line option, Command-Line and Configuration Options

Hewitt, Eben, Cassandra

highlighting syntax, Syntax Highlighting and Checking

Hive, Apache, Pig and Hive

I

illustrate operator, illustrate

implicit splits, Nonlinear Data Flows

import command, Including Other Pig Latin Scripts

including other Pig Latin scripts, Including Other Pig Latin Scripts

INDEXOF function, Built-in chararray and bytearray UDFs

inner joins, Join, Joining sorted data

input clause (define command), stream

input schemas, Input and Output Schemas

input size, Making Pig Fly

InputFormat, determining, Determining InputFormat

int AVG function, Built-in aggregate UDFs

int functions, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs

INDEXOF, Built-in chararray and bytearray UDFs
LAST_INDEX_OF, Built-in chararray and bytearray UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs

int type, Scalar Types, Schemas, Python UDFs

intermediate results size, Making Pig Fly

invoker methods, Calling Static Java Functions

isSuccessful method, Run

iterative processing, What Is Pig Useful For?, Embedding Pig Latin in Python, Binding Multiple Sets of Variables

J

Jackson JSON library, Writing Load and Store Functions

JAR files, Downloading Pig Artifacts from Maven, Registering UDFs, Registering Python UDFs, Testing Your Scripts with PigUnit, Utility Methods, Python UDFs, Writing Load and Store Functions, Piggybank

downloading, Downloading Pig Artifacts from Maven
Jackson, Writing Load and Store Functions
Jython, Registering Python UDFs
Piggybank, Registering UDFs, Piggybank
pigunit, Testing Your Scripts with PigUnit
registering, Utility Methods, Python UDFs

Java, Pig Philosophy, Downloading the Pig Package from Apache, Downloading the Pig Package from Apache, Command-Line and Configuration Options, Scalar Types–Nulls, Bag, Filter, User Defined Functions, define and UDFs, Calling Static Java Functions, Calling Static Java Functions, Joining small to large data, mapreduce, set, Setting the Partitioner, Testing Your Scripts with PigUnit, Embedding Pig Latin in Python, Writing an Evaluation Function in Java–Memory Issues in Eval Funcs, Interacting with Pig values, Input and Output Schemas, Input and Output Schemas, Input and Output Schemas, Input and Output Schemas, Loading the distributed cache, Overloading UDFs, Python UDFs, Casting bytearrays, Store Functions, Cascading, HBase, Built-in Evaluation and Filter Functions, Map Phase

and Cascading data flows, Cascading
casting and HBase, HBase
compared with Python, Python UDFs
data types used by Pig, Scalar Types–Nulls, Input and Output Schemas
embedding interface, Embedding Pig Latin in Python
evaluation functions in, Writing an Evaluation Function in Java–Memory Issues in Eval Funcs, Built-in Evaluation and Filter Functions
integration with Pig, Pig Philosophy, Downloading the Pig Package from Apache
Iterable, Interacting with Pig values
JUnit, Testing Your Scripts with PigUnit
and MapReduce, Map Phase
memory requirements of, Bag, Joining small to large data
multiple inheritance workaround, Casting bytearrays, Store Functions
passing arguments to, mapreduce
properties used by Pig and Hadoop, Command-Line and Configuration Options, set
reflection, Calling Static Java Functions, Input and Output Schemas, Input and Output Schemas
regular expressions, Filter
setting JAVA_HOME, Downloading the Pig Package from Apache
setting the Partitioner, Setting the Partitioner
static functions, Calling Static Java Functions
UDFs and, User Defined Functions, define and UDFs, Input and Output Schemas, Loading the distributed cache, Overloading UDFs

JobTracker, Running Pig on Your Hadoop Cluster, MapReduce Job Status, Error Handling and Progress Reporting, MapReduce

join operator, Parallel

joining small to large data, Joining small to large data, Distributed Cache

joining sorted data, Joining sorted data

joins, Comparing query and dataflow languages, How Pig differs from MapReduce, What Is Pig Useful For?, Join–Join, Join, Join, Parallel, Using Different Join Implementations–cross, Joining small to large data, Joining sorted data, Joining sorted data, Nonlinear Data Flows, Setting the Partitioner, illustrate, Filter Early and Often, Set Up Your Joins Properly, Determining the location

default behavior, Join–Join
and filter pushing, Filter Early and Often
how to update every five minutes, What Is Pig Useful For?
inner, Join, Joining sorted data
input path overwritten, Determining the location
no multiquery for, Nonlinear Data Flows
other implementations, Using Different Join Implementations–cross, Set Up Your Joins Properly
outer, Join, Joining small to large data
parallel clause and, Parallel
partition clause and, Setting the Partitioner
in Pig Latin versus MapReduce, How Pig differs from MapReduce
in Pig Latin versus SQL, Comparing query and dataflow languages
and sample records, illustrate
sort-merge, Joining sorted data

JSON, Schemas, Interacting with Pig values, Writing Load and Store Functions–Loading metadata, Determining OutputFormat–Storing Metadata

JsonLoader example, Interacting with Pig values, Writing Load and Store Functions–Loading metadata
JsonStorage example, Determining OutputFormat–Storing Metadata

JUnit, Testing Your Scripts with PigUnit

Jython, User Defined Functions, Registering Python UDFs, Python UDFs

K

keys, Pig on Hadoop, How Pig differs from MapReduce
kill command, Controlling Pig from Grunt

L

LAST_INDEX_OF function, Built-in chararray and bytearray UDFs

LCFIRST function, Built-in chararray and bytearray UDFs

Le Dem, Julien, Embedding Pig Latin in Python

licensing, What Is Pig?, Using Compression in Intermediate Results

limit operator, Limit, Parallel, Nested foreach

limit option (HBase), HBase

LimitOptimizer optimization, Debugging Tips

linear data flows, Nonlinear Data Flows

load clause (mapreduce statement), mapreduce

load function (PigStorage), Choose the Right Data Type

load functions (Pig), Load Functions–Pushing down projections, Frontend Planning Functions–Passing Information from the Frontend to the Backend, Passing Information from the Frontend to the Backend, Backend Data Reading–Reading records, Additional Load Function Interfaces–Pushing down projections, Loading metadata, Built-in Load and Store Functions

additional interfaces, Additional Load Function Interfaces–Pushing down projections
backend data reading, Backend Data Reading–Reading records
built-in, Built-in Load and Store Functions
frontend planning functions, Frontend Planning Functions–Passing Information from the Frontend to the Backend
loading metadata, Loading metadata
passing info frontend to backend, Passing Information from the Frontend to the Backend

load operator, Load, explain, Filter Early and Often

loadKey option (HBase), HBase

local mode, Running Pig Locally on Your Machine

Local Rearrange operator, explain

LOG function, Built-in math UDFs

LOG10 function, Built-in math UDFs

logical optimizer, Debugging Tips

logical plan, explain, Debugging Tips

LogicalExpressionsSimplifier optimization, Debugging Tips

logs, MapReduce Job Status, Error Handling and Progress Reporting

long AVG function, Built-in aggregate UDFs

long functions, Built-in math UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs, Built-in chararray and bytearray UDFs, Built-in complex type UDFs

COUNT, Built-in aggregate UDFs
COUNT_STAR, Built-in aggregate UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
ROUND, Built-in math UDFs
SIZE, Built-in chararray and bytearray UDFs, Built-in complex type UDFs
SUM, Built-in aggregate UDFs

long type, Scalar Types, Schemas, Python UDFs

lookup table, constructing, Constructors and Passing Data from Frontend to Backend

LOWER function, Built-in chararray and bytearray UDFs

lt option (HBase), HBase

lte option (HBase), HBase

LZO compression type, Using Compression in Intermediate Results

M

macros, Macros

map data type, Map, Schemas, Python UDFs

map only jobs, Reduce Phase

map parallelism, Parallel

map phase, Pig on Hadoop, Map Phase

map projection operator (#), Expressions in foreach

map TOMAP function, Built-in complex type UDFs

MapReduce, Pig on Hadoop, How Pig differs from MapReduce–How Pig differs from MapReduce, mapreduce, MapReduce Job Status, Tune Pig and Hadoop for Your Job, MapReduce

how Pig differs from, How Pig differs from MapReduce–How Pig differs from MapReduce
integrating with Pig, mapreduce
job status, MapReduce Job Status
performance tuning properties, Tune Pig and Hadoop for Your Job

mapreduce operator, mapreduce, Filter Early and Often

“Mary had a Little Lamb” example, MapReduce’s hello world

Maven, downloading Pig from, Downloading Pig Artifacts from Maven

MAX functions, Built-in aggregate UDFs

memory, Bag, Making Pig Fly, Tune Pig and Hadoop for Your Job

buffer size, Tune Pig and Hadoop for Your Job
requirements for Pig data types, Bag
size, Making Pig Fly

merge join, Joining sorted data, Set Up Your Joins Properly

MergeFilter optimization, Debugging Tips

MergeForEach optimization, Debugging Tips

metadata, Loading metadata, Storing Metadata, Metadata in Hadoop

in Hadoop, Metadata in Hadoop
loading, Loading metadata
storing, Storing Metadata

metropolitan name example, Constructors and Passing Data from Frontend to Backend–Loading the distributed cache

MIN functions, Overloading UDFs, Built-in aggregate UDFs

multiple bindings, running, Running Multiple Bindings

multiple joins, Join

multiple keys, grouping on, Group

multiquery, Nonlinear Data Flows, Use Multiquery When Possible

multiway joins, Joining skewed data

N

NameNode, Running Pig on Your Hadoop Cluster, Joining small to large data, Data Layout Optimization, Loading the distributed cache, Distributed Cache, Hadoop Distributed File System

namespaces, Registering Python UDFs

nested foreach, Nested foreach–Nested foreach

noise words, Join

nonlinear data flows, Nonlinear Data Flows–Nonlinear Data Flows

NoSQL databases, NoSQL Databases

null, Nulls, Expressions in foreach, Filter, Join, Error Handling and Progress Reporting

NYSE examples, Code Examples in This Book, Running Pig Locally on Your Machine, Casts, Distinct, Join, Nested foreach, Nested foreach, Nested foreach, Joining sorted data, stream, Macros, UDFContext

average dividends, Running Pig Locally on Your Machine
buy/sell analyzer, UDFContext
daily sorted dividends, Joining sorted data
data set, Code Examples in This Book
dividends increased between two dates, Join
filter out low-dividend stocks, stream
find list of ticker symbols, Distinct
number of unique stock symbols, Nested foreach
stock-price changes on dividend days, Macros
top three dividends, Nested foreach
total trade estimate, Casts
tracking a stock over time, Nested foreach

O

Olston, Christopher, Pig’s History
optimizations, turning off, Debugging Tips, Debugging Tips
optimizing scripts, Making Pig Fly–Bad Record Handling
order by operator, How Pig differs from MapReduce, Order by
order operator, Order by, Order by, Parallel, Nested foreach, Setting the Partitioner
outer joins, Join, Joining small to large data
output clause (define command), stream
output location, Setting the output location
output phase, Output Phase
output schemas, Input and Output Schemas
output size, Making Pig Fly
OutputFormat, Store Functions, Output Phase
overloading, Calling Static Java Functions, Overloading UDFs

P

Package operator, explain

page rank, calculating from web crawl, Embedding Pig Latin in Python–Utility Methods

parallel clause, Parallel

parallel dataflow language, Pig Latin, a Parallel Dataflow Language

parallelism, Select the Right Level of Parallelism, Where Your UDF Will Run, Writing Load and Store Functions

parameter substitution, Parameter Substitution–Parameter Substitution

partition clause, Setting the Partitioner

Partitioner class, Setting the Partitioner, Shuffle Phase

partitions, using, Using partitions

performance tuning properties (MapReduce), Tune Pig and Hadoop for Your Job

philosophy of Pig, Pig Philosophy

physical plan, explain

Pig, Pig Philosophy, Pig’s History, Downloading and Installing Pig–Downloading the Source, Downloading the Pig Package from Apache, Downloading the Pig Package from Apache, Downloading the Source, Downloading the Source, Running Pig–Command-Line and Configuration Options, Casts, Integrating Pig with Legacy Code and MapReduce–mapreduce, Tune Pig and Hadoop for Your Job, Utility Methods, Python UDFs

downloading and installing, Downloading and Installing Pig–Downloading the Source
fs method, Utility Methods
history, Pig’s History
integrating with legacy code and MapReduce, Integrating Pig with Legacy Code and MapReduce–mapreduce
issue-tracking system, Downloading the Source
performance tuning, Tune Pig and Hadoop for Your Job
philosophy, Pig Philosophy
portability, Downloading the Pig Package from Apache
release page, Downloading the Pig Package from Apache
running, Running Pig–Command-Line and Configuration Options
strength of typing, Casts
translation to Python types, Python UDFs
version control page, Downloading the Source

“Pig counts Mary and her lamb” example, MapReduce’s hello world

Pig Latin, What Is Pig?, What Is Pig Useful For?, Preliminary Matters, Preliminary Matters, Case Sensitivity, Comments, Input and Output–Dump, Relational Operations–Parallel, Pig Latin Preprocessor–Including Other Pig Latin Scripts, Developing and Testing Pig Latin Scripts–Testing Your Scripts with PigUnit, Syntax Highlighting and Checking, Embedding Pig Latin in Python–Utility Methods

best use cases for, What Is Pig Useful For?
case sensitivity, Case Sensitivity
comment operators, Comments
developing and testing scripts, Developing and Testing Pig Latin Scripts–Testing Your Scripts with PigUnit
embedding in Python, Embedding Pig Latin in Python–Utility Methods
fields, Preliminary Matters
input and output, Input and Output–Dump
preprocessor, Pig Latin Preprocessor–Including Other Pig Latin Scripts
relational operations, Relational Operations–Parallel
relations, Preliminary Matters
syntax highlighting packages, Syntax Highlighting and Checking

“Pig Latin: A Not-So-Foreign Language for Data Processing” (Olston), Pig’s History

Piggybank, User Defined Functions, Piggybank

PigStats methods, Run

PigStorage function, Store, Getting the casting functions, Built-in Load and Store Functions, Built-in Load and Store Functions

PigUnit, Testing Your Scripts with PigUnit–Testing Your Scripts with PigUnit

pipelines, data, What Is Pig Useful For?, Debugging Tips, Pig and Hive, Metadata in Hadoop

POSIX, Pig on Hadoop, Hadoop Distributed File System

power law distribution, Group

“Practical Skew Handling in Parallel Joins” (DeWitt et al.), Joining skewed data

prepareToRead, Getting ready to read

prepareToWrite method, Preparing to write

prereduce merge, Combiner Phase

projections, pushing down, Pushing down projections

-propertyFile (-P) command-line option, Command-Line and Configuration Options

PushDownForeachFlatten feature, Debugging Tips

PushUpFilter optimization, Debugging Tips

Pygmalion project, Cassandra

Python, User Defined Functions, Registering Python UDFs, Embedding Pig Latin in Python–Utility Methods, Python UDFs–Python UDFs

embedding Pig Latin in, Embedding Pig Latin in Python–Utility Methods
UDFs, User Defined Functions, Registering Python UDFs, Python UDFs–Python UDFs

Q

query languages, Comparing query and dataflow languages

R

RANDOM functions, Miscellaneous built-in UDF
raw data, What Is Pig Useful For?, Pig and Hive
RDBMS versus Hadoop environments, Comparing query and dataflow languages, Using Different Join Implementations
RecordWriter class, Preparing to write, Output Phase
reduce phase, Pig on Hadoop, Reduce Phase
reducers, How Pig differs from MapReduce, Group, Order by, Joining skewed data, Select the Right Level of Parallelism, Combiner Phase
reflection, Calling Static Java Functions, Input and Output Schemas, Input and Output Schemas
REGEX_EXTRACT function, Built-in chararray and bytearray UDFs
REGEX_EXTRACT_ALL function, Built-in chararray and bytearray UDFs
register command, Registering UDFs
registerJar utility method, Utility Methods
registerUDF utility method, Utility Methods
regular expressions, Filter
relational operations, Relational Operations–Parallel, Advanced Features of foreach–cross
relations, Preliminary Matters
REPLACE function, Built-in chararray and bytearray UDFs
result method, Run
return codes, Return Codes, Run
returns clause (define statement), Macros
rmr command, HDFS Commands in Grunt
ROUND function, Built-in math UDFs
run command, Controlling Pig from Grunt
running multiple bindings, Running Multiple Bindings
“Running Pig in Local Mode” example, Running Pig Locally on Your Machine
“Running Pig On Your Cluster” example, Running Pig on Your Hadoop Cluster
runSingle command, Run
runtime declaration (schemas), Schemas
runtime exceptions, Input and Output Schemas

S

sampling, Sample, illustrate

illustrate tool, illustrate
sample operator, Sample

scalar types, Scalar Types

schemas, Schemas–Casts, Input and Output Schemas–Input and Output Schemas, Python UDFs, Loading metadata, Checking the schema

scripts, Testing Your Scripts with PigUnit–Testing Your Scripts with PigUnit, Making Pig Fly–Bad Record Handling

optimizing, Making Pig Fly–Bad Record Handling
testing with PigUnit, Testing Your Scripts with PigUnit–Testing Your Scripts with PigUnit

self joins, Join

semi-join, cogroup

set command, set

set utility method, Utility Methods

setLocation, Determining the location

setOutputPath utility function, Setting the output location

setStoreLocation function, Setting the output location

setting the Partitioner, Setting the Partitioner

ship clause, stream

shuffle phase, Pig on Hadoop, Shuffle Phase

shuffle size, Making Pig Fly

SIN function, Built-in math UDFs

SINH function, Built-in math UDFs

SIZE functions, Built-in chararray and bytearray UDFs, Built-in complex type UDFs

skew joins, Joining skewed data, Setting the Partitioner, Set Up Your Joins Properly, Tune Pig and Hadoop for Your Job

skew, handling of, How Pig differs from MapReduce, Group, Group, Order by, Joining skewed data, Setting the Partitioner, Set Up Your Joins Properly, Select the Right Level of Parallelism, Tune Pig and Hadoop for Your Job, Algebraic Interface, Combiner Phase

Hadoop combiner, Group, Algebraic Interface, Combiner Phase
order by operator, Order by
skew joins, Joining skewed data, Setting the Partitioner, Set Up Your Joins Properly, Tune Pig and Hadoop for Your Job

sort command, Filter Early and Often

sort-merge join, Joining sorted data

source code, Downloading the Source

speculative execution, Select the Right Level of Parallelism, Handling Failure

spill files, number of, Tune Pig and Hadoop for Your Job

spilling to disk, Memory Issues in Eval Funcs

split operator, Nonlinear Data Flows, Filter Early and Often

SplitCombination optimization, Debugging Tips

SplitFilter optimization, Debugging Tips

SQL compared/contrasted with Pig, Comparing query and dataflow languages–Comparing query and dataflow languages, Tuple, Bag, Filter, Filter, Group, Distinct, Join, Join, Using Different Join Implementations, union, Pig and Hive, Built-in aggregate UDFs

Apache Hive, Pig and Hive
constraints on data, Bag
dataflow and query languages, Comparing query and dataflow languages–Comparing query and dataflow languages
group operator, Group
long COUNT, Built-in aggregate UDFs
noise words, Join
nulls, Filter, Join
optimizers, Using Different Join Implementations
trinary logic, Filter
tuples, Tuple
union, union
use of distinct statement, Distinct

SQL layer (Apache Hive), Pig and Hive

SQRT function, Built-in math UDFs

static Java functions, Calling Static Java Functions

statistics summary, Pig Statistics

stats command, Pig Statistics

stock analyzer example, UDFContext

store clause (mapreduce statement), mapreduce

store functions, Writing Load and Store Functions, Store Functions–Storing Metadata, Built-in Load and Store Functions

built-in, Built-in Load and Store Functions
writing, Writing Load and Store Functions, Store Functions–Storing Metadata

store operator, Store, explain, Filter Early and Often

StoreFunc class, Store Functions

storing metadata, Storing Metadata

stream operator, stream, Filter Early and Often

streams, number of, Tune Pig and Hadoop for Your Job

STRSPLIT functions, Built-in chararray and bytearray UDFs

subqueries, Pig alternative to, Comparing query and dataflow languages

SUBSTRING functions, Built-in chararray and bytearray UDFs

SUM functions, Algebraic Interface, Built-in aggregate UDFs, Built-in aggregate UDFs

svn version control, Downloading the Source

syntax highlighting and checking, Syntax Highlighting and Checking

synthetic join, cross

T

tab delimited files, Choose the Right Data Type
TAN function, Built-in math UDFs
TANH function, Built-in math UDFs
tarball, Hadoop, Downloading the Pig Package from Apache, Using Compression in Intermediate Results
TaskTracker, MapReduce, Hadoop Distributed File System
testing scripts with PigUnit, Testing Your Scripts with PigUnit–Testing Your Scripts with PigUnit
TextLoader function, Built-in Load and Store Functions
TextMate syntax highlighting, Syntax Highlighting and Checking
theta joins, cross
threshold usage, Tune Pig and Hadoop for Your Job
TOBAG function, Built-in complex type UDFs
TOKENIZE function, Built-in chararray and bytearray UDFs
TOMAP function, Built-in complex type UDFs
TOP function, Built-in complex type UDFs
TOTUPLE function, Built-in complex type UDFs
TRIM function, Built-in chararray and bytearray UDFs
trinary logic, Filter
tuning Pig and Hadoop, Tune Pig and Hadoop for Your Job
tuple data type, Tuple, Schemas, Interacting with Pig values, Python UDFs
tuple projection operator (.), Expressions in foreach
tuple TOTUPLE function, Built-in complex type UDFs
TupleFactory class, Interacting with Pig values
Turing Complete Pig, Embedding Pig Latin in Python
turning off features, Debugging Tips
typechecking, Input and Output Schemas, Overloading UDFs
types, data, Types–Nulls, Python UDFs

U

UCFIRST function, Built-in chararray and bytearray UDFs

UDFContext class, UDFContext, Store Functions and UDFContext

UDFs (User Defined Functions), Code Examples in This Book, UDFs in foreach, User Defined Functions, Registering UDFs–Registering Python UDFs, define and UDFs, Writing Your UDF to Perform, Writing an Evaluation Function in Java, Where Your UDF Will Run, Error Handling and Progress Reporting, Overloading UDFs, Built-in UDFs–Miscellaneous built-in UDF

built-in, Built-in UDFs–Miscellaneous built-in UDF
define and, define and UDFs
error handling, Error Handling and Progress Reporting
in foreach, UDFs in foreach
naming, Writing an Evaluation Function in Java
optimizing, Writing Your UDF to Perform
overloading, Overloading UDFs
registering, Registering UDFs–Registering Python UDFs
where your UDF will run, Where Your UDF Will Run

union operator, How Pig differs from MapReduce, union, Nonlinear Data Flows, Filter Early and Often, Determining the location

UPPER function, Built-in chararray and bytearray UDFs

User Defined Functions, UDFs in foreach (see UDFs)

using clause (load function), Load

using clause (store function), Store

Utf8StorageConverter, Casting bytearrays

utility methods, Utility Methods

V

variables, binding multiple sets of, Binding Multiple Sets of Variables

-version command-line option, Command-Line and Configuration Options

version control with git, Downloading the Source

version differences in Hadoop, Running Pig on Your Hadoop Cluster, Load

file locations, Running Pig on Your Hadoop Cluster
globs, Load

version differences in Pig, Downloading the Pig Package from Apache, Running Pig Locally on Your Machine, Running Pig on Your Hadoop Cluster, Command-Line and Configuration Options, HDFS Commands in Grunt, HDFS Commands in Grunt, Map, Schemas, Schemas, Dump, Expressions in foreach, Parallel, User Defined Functions, User Defined Functions, Registering UDFs, Registering UDFs, Registering Python UDFs, Calling Static Java Functions, flatten, Joining skewed data, Joining sorted data, cross, mapreduce, Setting the Partitioner, Pig Latin Preprocessor, Macros, Including Other Pig Latin Scripts, illustrate, Pig Statistics, Debugging Tips, Testing Your Scripts with PigUnit, Project Early and Often, Data Layout Optimization, Embedding Pig Latin in Python, Writing Evaluation and Filter Functions, Writing Evaluation and Filter Functions, Input and Output Schemas, Loading the distributed cache, UDFContext, Python UDFs, Writing Load and Store Functions, Casting bytearrays, HBase, Built-in Evaluation and Filter Functions–Miscellaneous built-in UDF

.. field range, Expressions in foreach
built-in eval and filter functions, Built-in Evaluation and Filter Functions–Miscellaneous built-in UDF
bytesToMap methods, Casting bytearrays
column families, HBase
data layout optimization, Data Layout Optimization
dependencies inside Python scripts, Registering Python UDFs
dump output, Dump
EvalFunc, Loading the distributed cache
flatten schema bug, flatten
globs accepted by register, Registering UDFs
Grunt command sh, HDFS Commands in Grunt
hadoop fs shell commands, Running Pig on Your Hadoop Cluster, HDFS Commands in Grunt
Hadoop requirements, Downloading the Pig Package from Apache
handling of Java properties, Command-Line and Configuration Options
HDFS paths for register, Registering UDFs
illustrate, illustrate
invoker methods, Calling Static Java Functions
Java eval funcs, Writing Evaluation and Filter Functions
joins, Joining skewed data, Joining sorted data
load and store functions, Writing Load and Store Functions
local mode execution, Running Pig Locally on Your Machine
logical optimizer and plan, Debugging Tips, Project Early and Often
macros, Macros
map declared values, Map
map schemas, Input and Output Schemas
mapreduce command, mapreduce
non-Java UDFs, User Defined Functions
number of output records in a bag, cross
parallel level, Parallel
PigUnit, Testing Your Scripts with PigUnit
preprocessor actions, Pig Latin Preprocessor, Including Other Pig Latin Scripts
Python, Embedding Pig Latin in Python, Writing Evaluation and Filter Functions, Python UDFs
runtime adaption code, Schemas
setting the Partitioner, Setting the Partitioner
summary statistics, Pig Statistics
truncation and null padding, Schemas
UDFContext class, UDFContext
UDFs languages, User Defined Functions

Vim syntax highlighting, Syntax Highlighting and Checking

W

warn method, Error Handling and Progress Reporting

web crawl, Embedding Pig Latin in Python–Utility Methods, Embedding Pig Latin in Python–Utility Methods

calculating page rank from, Embedding Pig Latin in Python–Utility Methods
data set, Embedding Pig Latin in Python–Utility Methods

White, Tom, Tune Pig and Hadoop for Your Job, Overview of Hadoop

word count example, MapReduce’s hello world

writing MapReduce in Java, compared to Pig Latin, How Pig differs from MapReduce

writing records, Writing records–Writing records

Y

Yahoo!, Pig’s History

Get Programming Pig now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Cover of Software Architecture Patterns

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now