Before you can run Pig on your machine or your Hadoop cluster, you will need to download and install it. If someone else has taken care of this, you can skip ahead to Running Pig.
This is the official version of Apache Pig. It comes packaged with all of the JAR files needed to run Pig. It can be downloaded by going to Pig’s release page.
Pig does not need to be installed on your Hadoop cluster. It runs on the machine from which you launch Hadoop jobs. Though you can run Pig from your laptop or desktop, in practice, most cluster owners set up one or more machines that have access to their Hadoop cluster but are not part of the cluster (that is, they are not data nodes or task nodes). This makes it easier for administrators to update Pig and associated tools, as well as to secure access to the clusters. These machines are called gateway machines or edge machines. In this book I use the term gateway machine.
You will need to install Pig on these gateway machines. If your Hadoop cluster is accessible from your desktop or laptop, you can install Pig there as well. Also, you can install Pig on your local machine if you plan to use Pig in local mode.
The core of Pig is written in Java and is thus portable across operating systems. The shell script that starts Pig is a bash script, so it requires a Unix environment. Hadoop, which Pig depends on, even in local mode, also requires a Unix environment for its filesystem operations. In practice, most Hadoop clusters run a flavor of Linux. Many Pig developers develop and test Pig on Mac OS X.
Pig requires Java 1.6, and Pig versions 0.5 through 0.9 require Hadoop 0.20. For future versions, check the download page for information on what version(s) of Hadoop they require. The correct version of Hadoop is included with the Pig download. If you plan to use Pig in local mode or install it on a gateway machine where Hadoop is not currently installed, there is no need to download Hadoop separately.
Once you have downloaded Pig, you can place it anywhere you like on your machine, as it does not depend on being in a certain location. To install it, place the tarball in the directory of your choosing and type:
The only other setup in preparation for running
Pig is making sure that the environment variable
JAVA_HOME is set to the directory that contains your Java
distribution. Pig will fail immediately if this value is not in the
environment. You can set this in your shell, specify it on the command
line when you invoke Pig, or set it explicitly in your copy of the Pig
pig, located in the bin directory that you just unpacked. You can
find the appropriate value for
JAVA_HOME by executing
java and stripping the
bin/java from the
end of the result.
In addition to the official Apache version, there are companies that repackage and distribute Hadoop and associated tools. Currently the most popular of these is Cloudera, which produces RPMs for Red Hat–based systems and packages for use with APT on Debian systems. It also provides tarballs for other systems that cannot use one of these package managers.
The upside of using a distribution like Cloudera’s is that all of the tools are packaged and tested together. Also, if you need professional support, it is available. The downside is that you are constrained to move at the speed of your distribution provider. There is a delay between an Apache release of Pig and its availability in various distributions.
For complete instructions on downloading and installing Hadoop and Pig from Cloudera, see Cloudera’s download site. Note that you have to download Pig separately; it is not part of the Hadoop package.
In addition to the official release available from Pig’s
Apache site, it is possible to download Pig from Apache’s
Maven repository. This site includes JAR files for Pig, for the source code, and for the
Javadocs, as well as the POM file that defines Pig’s dependencies.
Development tools that are Maven-aware can use this to pull down Pig’s
source and Javadoc. If you use
ant in your build process, you can also pull the Pig
JAR from this repository automatically.
When you download Pig from Apache, you also get
the Pig source code. This enables you to debug your version of
Pig or just peruse the code to see how it works. But if you want to live
on the edge and try out a feature or a bug fix before it is available in
a release, you can download the source from Apache’s Subversion
repository. You can also apply patches that have been uploaded to Pig’s
system but that are not yet checked into the code repository.
Information on checking out Pig using
svn or cloning the repository via
git is available on
Pig’s version control
Running Pig locally on your machine is referred to in Pig parlance as local mode. Local mode is useful for prototyping and debugging your Pig Latin scripts. Some people also use it for small data when they want to apply the same processing to large data—so that their data pipeline is consistent across data of different sizes—but they do not want to waste cluster resources on small files and small jobs.
In versions 0.6 and earlier, Pig executed scripts in local
mode itself. Starting with version 0.7, it uses the Hadoop class
LocalJobRunner that reads from the local
filesystem and executes MapReduce jobs locally. This has the nice
property that Pig jobs run locally in the same way as they will on your
cluster, and they all run in one process, making debugging much easier.
The downside is that it is slow. Setting up a local instance of Hadoop
has approximately a 20-second overhead, so even tiny jobs take at least
Let’s run a Pig Latin script in local mode. See Code Examples in This Book for how to download the data and Pig Latin for this example. The simple script in Example 2-1 loads the file NYSE_dividends, groups the file’s rows by stock ticker symbol, and then calculates the average dividend for each symbol.
Example 2-1. Running Pig in local mode
--average_dividend.pig -- load data from NYSE_dividends, declaring the schema to have 4 fields dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend); -- group rows together by stock ticker symbol grouped = group dividends by symbol; -- calculate the average dividend per symbol avg = foreach grouped generate group, AVG(dividends.dividend); -- store the results to average_dividend store avg into 'average_dividend';
NYSE CPO 2009-12-30 0.14 NYSE CPO 2009-09-28 0.14 NYSE CPO 2009-06-26 0.14 NYSE CPO 2009-03-27 0.14 NYSE CPO 2009-01-06 0.14
This matches the schema we declared in our Pig Latin script. The first field is the exchange this stock is traded on, the second field is the stock ticker symbol, the third is the date the dividend was paid, and the fourth is the amount of the dividend.
Remember that to run Pig you will need to set the
JAVA_HOME environment variable to the directory that
contains your Java distribution.
pig_path/bin/pig -x local average_dividend.pig
The result should be a lot of output on your
screen. Much of this is MapReduce’s
LocalJobRunner generating logs. But some of it is
Pig telling you how it will execute the script, giving you the status as
it executes, etc. Near the bottom of the output you should see the
Success!. This means
all went well. The script stores its output to average_dividend, so you might expect to find
a file by that name in your local directory. Instead you will find a
directory named average_dividend that contains a
file named part-r-00000. Because
Hadoop is a distributed system and usually processes data in parallel,
when it outputs data to a “file” it creates a directory
with the file’s name, and each writer creates a separate part
file in that directory. In this case we had one writer, so
we have one part file. We can look in that part file for the results by
cat average_dividend/part-r-00000 | head -5
CA 0.04 CB 0.35 CE 0.04 CF 0.1 CI 0.04
Most of the time you will be running Pig on your Hadoop cluster. As was covered in Downloading and Installing Pig, Pig runs locally on your machine or your gateway machine. All of the parsing, checking, and planning is done locally. Pig then executes MapReduce jobs in your cluster.
When I say “your gateway machine,” I mean the machine from which you are launching Pig jobs. Usually this will be one or more machines that have access to your Hadoop cluster. However, depending on your configuration, it could be your local machine as well.
The only thing Pig needs to know to run on your cluster is the location of your cluster’s NameNode and JobTracker. The NameNode is the manager of HDFS, and the JobTracker coordinates MapReduce jobs. In Hadoop 0.18 and earlier, these locations are found in your hadoop-site.xml file. In Hadoop 0.20 and later, they are in three separate files: core-site.xml, hdfs-site.xml, and mapred-site.xml.
If you are already running Hadoop jobs from your gateway machine via MapReduce or another tool, you most likely have these files present. If not, the best course is to copy these files from nodes in your cluster to a location on your gateway machine. This guarantees that you get the proper addresses plus any site-specific settings.
<configuration> <property> <name>fs.default.name</name> <value>
port</value> </property> <property> <name>mapred.job.tracker</name> <value>
port</value> </property> </configuration>
Once you have located, copied, or created these
files, you will need to tell Pig the directory they are in by setting
PIG_CLASSPATH environment variable to that directory.
Note that this must point to the directory that the
XML file is in, not the file itself. Pig will read all XML and
properties files in that directory.
Let’s run the same script on your cluster that we ran in the local mode example (Example 2-1). If you are running on a Hadoop cluster you have never used before, you will most likely need to create a home directory. Pig can do this for you:
hadoop_conf_dir pig_path/bin/pig -e fs -mkdir
is the directory where your hadoop-site.xml or
core-site.xml, hdfs-site.xml, and mapred-site.xml files are located;
pig_path is the path to Pig on your gateway
username is your username on the
gateway machine. If you are using 0.5 or earlier, change
fs -mkdir to
Remember, you need to set
executing any Pig commands. See Downloading the Pig Package from Apache for details.
hadoop_conf_dir pig_path/bin/pig -e fs -copyFromLocal NYSE_dividends NYSE_dividends
hadoop_conf_dir pig_path/bin/pig average_dividend.pig
The first few lines of output will tell you how
Pig is connecting to your cluster. After that it will describe its
progress in executing your script. It is important for you to verify
that Pig is connecting to the appropriate filesystem and JobTracker by
checking that these values match the values for your cluster. If the
filesystem is listed as file:/// or
the JobTracker says
did not connect to your cluster. You will need to check that you entered
the values properly in your configuration files and properly set
PIG_CLASSPATH to the directory that contains those
hadoop_conf_dir pig_path/bin/pig -e cat average_dividend
In Example 2-1 you may have
noticed that I made a point to say that average_dividend is a directory, and thus you
cat the part file contained in that
directory. However, in this example I ran
directly on average_dividend. If
you list average_dividend, you will
see that it is still a directory in this example, but in Pig,
cat can operate on directories. See Chapter 3 for a discussion of this.
Cloud computing along with the software as a service (SaaS) model have taken off in recent years. This has been fortuitous for hardware-intensive applications such as Hadoop. Setting up and maintaining a Hadoop cluster is an expensive proposition in terms of hardware acquisition, facility costs, and maintenance and administration. Many users find that it is cheaper to rent the hardware they need instead.
Whether you or your organization decides to use Hadoop and Pig in the cloud or on owned and operated machines, the instructions for running Pig on your cluster are the same; see Running Pig on Your Hadoop Cluster.
However, Amazon’s Elastic MapReduce (EMR) cloud offering is different. Rather than allowing customers to rent machines for any type of process (like Amazon’s Elastic Cloud Computing [EC2] service and other cloud services), EMR allows users to rent virtual Hadoop clusters. These clusters read data from and write data to Amazon’s Simple Storage Service (S3). This means users do not even need to set up their own Hadoop cluster, which they would have to do if they used EC2 or a similar service.
EMR users can access their rented Hadoop cluster via their browser, SSH, or a web services API. For information about EMR, visit http://aws.amazon.com/elasticmapreduce. However, I suggest beginning with this nice tutorial, which will introduce you to the service.
Pig has a number of command-line options that you can use with it. You can see
the full list by entering
pig -h. Most of these
options will be discussed later, in the sections that cover the features
these options control. In this section I discuss the remaining
Hadoop also has a number of Java properties it uses to
determine its behavior. For example, you can pass options to the JVM
that runs your map and reduce tasks by setting
mapred.child.java.opts. In Pig version 0.8 and later, these can be passed to Pig,
and then Pig will pass them on to Hadoop when it invokes Hadoop. In
earlier versions, the properties had to be in hadoop-site.xml so that the Hadoop client
itself would pick them up.
Properties can be passed to Pig on the command
-D in the same format as any Java
bin/pig -D exectype=local. When
placed on the command line, these property definitions must come before
any Pig-specific command-line options (such as
local). They can also be specified in the conf/pig.properties file that is part of your
Pig distribution. Finally, you can specify a separate properties file by
-P. If properties are specified on both the
command line and in a properties file, the command-line specification
Pig uses return codes, described in Table 2-1, to communicate success or failure.
Table 2-1. Pig return codes
|3||Partial failure||Used with multiquery; see Nonlinear Data Flows|
|4||Illegal arguments passed to Pig|
|5||Would usually be thrown by a UDF|
|6||Usually means a Python UDF raised an exception|
 Another reason for switching to MapReduce for local mode was that as Pig added features that took advantage of more advanced MapReduce features, it became difficult or impossible to replicate those features in local mode. Thus local mode and MapReduce mode were diverging in their feature set.
 Being the current flavor of the month, the term cloud computing is being used to describe just about anything that takes more than one computer and is not located on a person’s desktop. In this chapter I use cloud computing to mean the ability to rent a cluster of computers and place software of your choosing on those computers.