O'Reilly logo

Programming Pig by Alan Gates

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 9. Embedding Pig Latin in Python

Pig Latin is a dataflow language. Unlike general-purpose programming languages, it does not include control flow constructs such as if and for. For many data-processing applications, the operators Pig provides are sufficient. But there are classes of problems that either require the data flow to be repeated an indefinite number of times or need to branch based on the results of an operator. Iterative processing, where a calculation needs to be repeated until the margin of error is within an acceptable limit, is one example. It is not possible to know beforehand how many times the data flow will need to be run before processing begins.

Blending data flow and control flow in one language is difficult to do in a way that is useful and intuitive. Building a general-purpose language and all the associated tools, such as IDEs and debuggers, is a considerable undertaking; also, there is no lack of such languages already. If we turned Pig Latin into a general-purpose language, it would require users to learn a much bigger language to process their data. For these reasons, we decided to embed Pig in existing scripting languages. This avoids the need to invent a new language while still providing users with the features they need to process their data.[21]

As with UDFs, we chose to use Python for the initial release of embedded Pig in version 0.9. The embedding interface is a Java class, so a Jython interpreter is used to run these Python scripts that embed Pig. This means Python 2.5 features can be used but Python 3 features cannot. In the future we hope to extend the system to other scripting languages that can access Java objects, such as JavaScript[22] and JRuby. Of course, since the Pig infrastructure is all in Java, it is possible to use this same interface to embed Pig into Java scripts.

This embedding is done in a JDBC-like style, where your Python script first compiles a Pig Latin script, then binds variables from Python to it, and finally runs it. It is also possible to do filesystem operations, register JARs, and perform other utility operations through the interface. The top-level class for this interface is org.apache.pig.scripting.Pig.

Throughout this chapter we will use an example of calculating page rank from a web crawl. You can find this example under examples/ch9 in the example code. This code iterates over a set of URLs and links to produce a page rank for each URL.[23] The input to this example is the webcrawl data set found in the examples. Each record in this input contains a URL, a starting rank of 1, and a bag with a tuple for each link found at that URL:

http://pig.apache.org/privacypolicy.html 1 {(http://www.google.com/privacy.html)}
http://www.google.com/privacypolicy.html 1 {(http://www.google.com/faq.html)}
http://desktop.google.com/copyrights.html 1 {}

Even though control flow is done via a Python script, it can still be run using Pig’s bin/pig script. bin/pig looks for the #! line and calls the appropriate interpreter. This allows you to use these scripts with systems that expect to invoke a Pig Latin script. It also allows Pig to include UDFs from this file automatically and to give correct line numbers for error messages.

In order to use the Pig class and related objects, the code must first import them into the Python script:

from org.apache.pig.scripting import *


Calling the static method Pig.compile causes Pig to do an initial compilation of the code. Because we have not bound the variables yet, this check cannot completely verify the script. Type checking and other semantic checking is not done at this phase—only the syntax is checked. compile returns a Pig object that can be bound to a set of variables:

# pagerank.py
P = Pig.compile("""
previous_pagerank = load '$docs_in' as (url:chararray, pagerank:float,
outbound_pagerank = foreach previous_pagerank generate
                      pagerank / COUNT(links) as pagerank,
                      flatten(links) as to_url;
cogrpd            = cogroup outbound_pagerank by to_url,
                      previous_pagerank by url;
new_pagerank      = foreach cogrpd generate group as url,
                      (1 - $d) + $d * SUM (outbound_pagerank.pagerank)
                      as pagerank,
                      flatten(previous_pagerank.links) as links,
                      flatten(previous_pagerank.pagerank) AS previous_pagerank;
store new_pagerank into '$docs_out';
nonulls           = filter new_pagerank by previous_pagerank is not null and
                        pagerank is not null;
pagerank_diff     = foreach nonulls generate ABS (previous_pagerank - pagerank);
grpall            = group pagerank_diff all;
max_diff          = foreach grpall generate MAX (pagerank_diff);
store max_diff into '$max_diff';

The only pieces of this Pig Latin script that we have not seen before are the four parameters, marked in the script as $d, $docs_in, $docs_out, and $max_diff. The syntax for these parameters is the same as for parameter substitution. However, Pig expects these to be supplied by the control flow script when bind is called.

There are three other compilation methods in addition to the one shown in this example. compile(String name, String script) takes a name in addition to the Pig Latin to be compiled. This name can be used in other Pig Latin code blocks to import this block:

P1 = Pig.compile("initial", """
A = load 'input';
    P2 = Pig.compile("""
import initial;
B = load 'more_input';

There are two compilation methods called compileFromFile. These take the same arguments as compile, but they expect the script argument to refer to a file containing the script, rather than the script itself.

[21] In some of the documentation, wiki pages, and issues on JIRA, embedded Pig is referred to as Turing Complete Pig. This was what the project was called when it first started, even though we did not make Pig itself Turing complete.

[22] There is already an experimental version of JavaScript in 0.9.

[23] The example code was graciously provided by Julien Le Dem.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required