Preface

Data is addictive. Our ability to collect and store it has grown massively in the last several decades, yet our appetite for ever more data shows no sign of being satiated. Scientists want to be able to store more data in order to build better mathematical models of the world. Marketers want better data to understand their customers’ desires and buying habits. Financial analysts want to better understand the workings of their markets. And everybody wants to keep all their digital photographs, movies, emails, etc.

Before the computer and Internet revolutions, the US Library of Congress was one of the largest collections of data in the world. It is estimated that its printed collections contain approximately 10 terabytes (TB) of information. Today, large Internet companies collect that much data on a daily basis. And it is not just Internet applications that are producing data at prodigious rates. For example, the Large Synoptic Survey Telescope (LSST) under construction in Chile is expected to produce 15 TB of data every day.

Part of the reason for the massive growth in available data is our ability to collect much more data. Every time someone clicks a website’s links, the web server can record information about what page the user was on and which link he clicked. Every time a car drives over a sensor in the highway, its speed can be recorded. But much of the reason is also our ability to store that data. Ten years ago, telescopes took pictures of the sky every night. But they could not store the collected data at the same level of detail that will be possible when the LSST is operational. The extra data was being thrown away because there was nowhere to put it. The ability to collect and store vast quantities of data only feeds our data addiction.

One of the most commonly used tools for storing and processing data in computer systems over the last few decades has been the relational database management system (RDBMS). But as datasets have grown large, only the more sophisticated (and hence more expensive) RDBMSs have been able to reach the scale many users now desire. At the same time, many engineers and scientists involved in processing the data have realized that they do not need everything offered by an RDBMS. These systems are powerful and have many features, but many data owners who need to process terabytes or petabytes of data need only a subset of those features.

The high cost and unneeded features of RDBMSs have led to the development of many alternative data-processing systems. One such alternative system is Apache Hadoop. Hadoop is an open source project started by Doug Cutting. Over the past several years, Yahoo! and a number of other web companies have driven the development of Hadoop, which was based on papers published by Google describing how its engineers were dealing with the challenge of storing and processing the massive amounts of data they were collecting. Hadoop is installed on a cluster of machines and provides a means to tie together storage and processing in that cluster. For a history of the project, see Hadoop: The Definitive Guide, by Tom White (O’Reilly).

The development of new data-processing systems such as Hadoop has spurred the porting of existing tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data-processing applications in low-level Java code.

Who Should Read This Book

This book is intended for Pig programmers, new and old. Those who have never used Pig will find introductory material on how to run Pig and to get them started writing Pig Latin scripts. For seasoned Pig users, this book covers almost every feature of Pig: different modes it can be run in, complete coverage of the Pig Latin language, and how to extend Pig with your own user-defined functions (UDFs). Even those who have been using Pig for a long time are likely to discover features they have not used before.

Some knowledge of Hadoop will be useful for readers and Pig users. If you’re not already familiar with it or want a quick refresher, “Pig on Hadoop” walks through a very simple example of a Hadoop job.

Small snippets of Java, Python, and SQL are used in parts of this book. Knowledge of these languages is not required to use Pig, but knowledge of Python and Java will be necessary for some of the more advanced features. Those with a SQL background may find “Comparing Query and Data Flow Languages” to be a helpful starting point in understanding the similarities and differences between Pig Latin and SQL.

What’s New in This Edition

The second edition covers Pig 0.10 through Pig 0.16, which is the latest version at the time of writing. For features introduced before 0.10, we will not call out the initial version of the feature. For newer features introduced after 0.10, we will point out the version in which the feature was introduced.

Pig runs on both Hadoop 1 and Hadoop 2 for all the versions covered in the book. To simplify our discussion, we assume Hadoop 2 is the target platform and will point out the difference for Hadoop 1 whenever applicable in this edition.

The second edition has two new chapters: “Pig on Tez” (Chapter 11) and “Use Cases and Programming Examples” (Chapter 13). Other chapters have also been updated with the latest additions to Pig and information on existing features not covered in the first edition. These include but are not limited to:

  • New data types (boolean, datetime, biginteger, bigdecimal) are introduced in Chapter 3.

  • New UDFs are covered in various places, including support for leveraging Hive UDFs (Chapter 4) and applying Bloom filters (Chapter 7).

  • New Pig operators and constructs such as rank, cube, assert, nested foreach and nested cross, and casting relations to scalars are presented in Chapter 5.

  • New performance optimizations—map-side aggregation, schema tuples, the shared JAR cache, auto local and direct fetch modes, etc.—are covered in Chapter 7.

  • Scripting UDFs in JavaScript, JRuby, Groovy, and streaming Python are discussed in Chapter 9, and embedding Pig in scripting languages is covered in Chapter 8 and Chapter 13 (“k-Means”). We also describe the Pig progress notification listener in Chapter 8.

  • We look at the new EvalFunc interface in Chapter 9, including the topics of compile-time evaluation, shipping dependent JARs automatically, and variable-length inputs. The new LoadFunc/StoreFunc interface is described in Chapter 10: we discuss topics such as predicate pushdown, auto-shipping JARs, and handling bad records.

  • New developments in community projects such as WebHCat, Spark, Accumulo, DataFu, and Oozie are described in Chapter 12.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context. Also used to show the output of describe statements in scripts.

Note

This icon signifies a tip, suggestion, or general note.

Caution

This icon indicates a warning or caution.

Code Examples in This Book

Many of the example scripts, UDFs, and datasets used in this book are available for download from Alan’s GitHub repository. README files are included to help you build the UDFs and understand the contents of the datafiles. Each example script in the text that is available on GitHub has a comment at the beginning that gives the filename. Pig Latin and Python script examples are organized by chapter in the examples directory. UDFs, both Java and Python, are in a separate directory, udfs. All datasets are in the data directory.

For brevity, each script is written assuming that the input and output are in the local directory. Therefore, when in local mode, you should run Pig in the directory that contains the input data. When running on a cluster, you should place the data in your home directory on the cluster.

Example scripts were tested against Pig 0.15.0 and should work against Pig 0.10.0 through 0.15.0 unless otherwise indicated.

The three datasets used in the examples are real datasets, though quite small ones. The file baseball contains baseball player statistics. The second set contains New York Stock Exchange data in two files: NYSE_daily and NYSE_dividends. This data was trimmed to include only stock symbols starting with C from the year 2009, to make it small enough to download easily. However, the schema of the data has not changed. If you want to download the entire dataset and place it on a cluster (only a few nodes would be necessary), it would be a more realistic demonstration of Pig and Hadoop. Instructions on how to download the data are in the README files. The third dataset is a very brief web crawl started from Pig’s home page.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, authors, publisher, and ISBN. For example: “Programming Pig by Alan Gates and Daniel Dai (O’Reilly). Copyright 2017 Alan Gates and Daniel Dai, 978-1-491-93709-9.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

Safari® Books Online

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

To comment or ask technical questions about this book, send email to:

For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at:

Acknowledgments from the First Edition (Alan Gates)

A book is like a professional football team. Much of the glory goes to the quarterback or a running back. But if the team has a bad offensive line, the quarterback never gets the chance to throw the ball. Receivers must be able to catch, and the defense must be able to prevent the other team from scoring. In short, the whole team must play well in order to win. And behind those on the field there is an array of coaches, trainers, and managers who prepare and guide the team. So it is with this book. My name goes on the cover. But without the amazing group of developers, researchers, testers, documentation writers, and users that contribute to the Pig project, there would be nothing worth writing about.

In particular, I would like to acknowledge Pig contributors and users for their contributions and feedback on this book. Chris Olston, Ben Reed, Richard Ding, Olga Natkovitch, Thejas Nair, Daniel Dai, and Dmitriy Ryaboy all provided helpful feedback on draft after draft. Julien Le Dem provided the example code for embedding Pig in Python. Jeremy Hanna wrote the section for Pig and Cassandra. Corrine Chandel deserves special mention for reviewing the entire book. Her feedback has added greatly to the book’s clarity and correctness.

Thanks go to Tom White for encouraging me in my aspiration to write this book, and for the sober warnings concerning the amount of time and effort it would require. Chris Douglas of the Hadoop project provided me with very helpful feedback on the sections covering Hadoop and MapReduce.

I would also like to thank Mike Loukides and the entire team at O’Reilly. They have made writing my first book an enjoyable and exhilarating experience. Finally, thanks to Yahoo! for nurturing Pig and dedicating more than 25 engineering years (and still counting) of effort to it, and for graciously giving me the time to write this book.

Second Edition Acknowledgments (Alan Gates and Daniel Dai)

In addition to the ongoing debt we owe to those acknowledged in the first edition, we would like to thank those who have helped us with the second edition. These include Rohini Palaniswamy and Sumeet Singh for their discussion of Pig at Yahoo!, and Yahoo! for allowing them to share their experiences. Zongjun Qi, Yiping Han, and Particle News also deserve our thanks for sharing their experience with Pig at Particle News. Thanks also to Ofer Mendelevitch for his suggestions on use cases

We would like to thank Tom Hanlon, Aniket Mokashi, Koji Noguchi, Rohini Palaniswamy, and Thejas Nair, who reviewed the book and give valuable suggestions to improve it.

We would like to thank Marie Beaugureau for prompting us to write this second edition, all her support along the way, and her patience with our sadly lax adherence to the schedule.

Finally, we would like to thank Hortonworks for supporting the Pig community and us while we worked on this second edition.

Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.