Chapter 10. Writing Load and Store Functions

We will now consider some of the more complex and most critical parts of Pig: data input and output. Operating on huge datasets is inherently I/O-intensive. Hadoop’s massive parallelism and movement of processing to the data mitigates but does not remove this. Having efficient methods to load and store data is therefore critical. Pig provides default load and store functions for text data and for HBase, but many users find they need to write their own load and store functions to handle the data formats and storage mechanisms they use.

As with evaluation functions, the design goal for load and store functions in Pig was to make easy things easy and hard things possible. Another aim was to make load and store functions a thin wrapper over Hadoop’s InputFormat and OutputFormat. The intention is that once you have an input format and output format for your data, the additional work of creating and storing Pig tuples is minimal. In the same way evaluation functions are implemented, more complex features such as schema management and projection pushdown are done via separate interfaces to avoid cluttering the base interface.

One other important design goal for load and store functions was to not assume that the input sources and output sinks are HDFS. In the examples throughout this book, A = load 'foo'; has implied that foo is a file, but there is no need for that to be the case. foo is a resource locator that makes sense to your load function. ...

Get Programming Pig, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.