Chapter 8. Embedding Pig

In addition to running Pig on the command line, you can also invoke Pig programmatically. In this chapter, we will explore two options: embedding Pig inside a scripting language and using the Pig Java APIs.

Embedding Pig Latin in Scripting Languages

As we’ve said previously, Pig Latin is a data flow language. Unlike general-purpose programming languages, it does not include control flow constructs such as if and for. For many data-processing applications, the operators Pig provides are sufficient. But there are classes of problems that either require the data flow to be repeated an indefinite number of times or need to branch based on the results of an operator. Iterative processing, where a calculation needs to be repeated until the margin of error is within an acceptable limit, is one example. It is not possible to know beforehand how many times the data flow will need to be run before processing begins.

Blending data flow and control flow in one language is difficult to do in a way that is useful and intuitive. Building a general-purpose language and all the associated tools, such as IDEs and debuggers, is a considerable undertaking; also, there is no lack of control flow languages already, and turning Pig Latin into a general-purpose language would require users to learn a much bigger language to process their data. For these reasons, the decision was made to instead embed Pig in existing scripting languages. This avoids the need to invent a new language ...

Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.