Chapter 11. Pig on Tez

Pig 0.14 introduced a new execution engine called Tez. You can view Tez as a faster and better MapReduce. Whenever possible, you should use Tez to run Pig instead of MapReduce.

What Is Tez?

As we have learned in previous chapters, Pig traditionally uses MapReduce as the execution engine. That is, a Pig Latin script will be translated into a series of MapReduce jobs for execution. For a long time, MapReduce was the only option to run a workload on a Hadoop cluster. However, with the introduction of YARN in Hadoop 2.0, this situation changed. With YARN, you can have different types of workloads on a single cluster; MapReduce is merely one of them.

MapReduce is not an optimal engine for Pig. MapReduce is very rigid; it requires Pig to decompose its workload into one or more MapReduce jobs. This prevents Pig from mixing map and reduce tasks in other ways or using other types of tasks. Also, MapReduce requires that data be stored in HDFS between jobs, preventing optimizations in data movement. Finally, MapReduce’s job scheduler thinks about each job in isolation; it does not understand that the series of jobs Pig is submitting are related. This limits its ability to schedule efficiently.

One of the important features of MapReduce is that it has a simple API—it is easy for a programmer to understand and write MapReduce code. However, for Pig, MapReduce is just an internal engine and the MapReduce API is not exposed to the user. The simplicity of the MapReduce API does ...

Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.