Chapter 11. Writing Load and Store Functions
We will now consider some of the more complex and most critical parts of Pig: data input and output. Operating on huge data sets is inherently I/O-intensive. Hadoop’s massive parallelism and movement of processing to the data mitigates but does not remove this. Having efficient methods to load and store data is therefore critical. Pig provides default load and store functions for text data and for HBase, but many users find they need to write their own load and store functions to handle the data formats and storage mechanisms they use.
As with evaluation functions, the design goal for
load and store functions was to make easy things easy and hard things
possible. Also, we wanted to make load and store functions a thin wrapper
over Hadoop’s InputFormat and
OutputFormat. The intention is that once you have an
input format and output format for your data, the additional work of
creating and storing Pig tuples is minimal. In the same way evaluation
functions were implemented, more complex features such as schema management
and projection push down are done via separate interfaces to avoid
cluttering the base interface. Pig’s load and store functions were completely rewritten
between versions 0.6 and 0.7. This chapter will cover only the interfaces
for 0.7 and later releases.
One other important design goal for load and store functions is to not assume that the input sources and output sinks are HDFS. In the examples throughout this book,