Chapter 1. What Is Pig?
Pig provides an engine for executing data flows in parallel on Apache Hadoop.
It includes a language, Pig Latin, for expressing
these data flows. Pig Latin includes operators for many of the traditional
data operations (join, sort, filter,
etc.), as well as providing the ability for users to develop their own
functions for reading, processing, and writing data.
Pig is an Apache open source project. This means users are free to download it as source or binary, use it for themselves, contribute to it, and—under the terms of the Apache License—use it in their products and change it as they see fit.
Pig Latin, a Parallel Data Flow Language
Pig Latin is a data flow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. These data flows can be simple linear flows, or complex workflows that include points where multiple inputs are joined and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data.
This means that Pig Latin looks different from
many of the programming languages you may have seen. There are no
if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access