Appendix B. Overview of Hadoop
This appendix gives a brief overview of Hadoop, focusing on elements that are of interest to Pig users. For a thorough discussion of Hadoop, see Hadoop: The Definitive Guide, by Tom White (O’Reilly). Hadoop’s two main components are MapReduce and HDFS.
MapReduce
MapReduce is the framework for running jobs in Hadoop. It provides a simple and powerful paradigm for parallelizing data processing.
The JobTracker is the central coordinator of jobs in MapReduce. It controls which jobs are being run, which resources they are assigned, etc. On each node in the cluster there is a TaskTracker that is responsible for running the map or reduce tasks assigned to it by the JobTracker.
MapReduce views its input as a collection of records. When reading from HDFS, a record is usually a single line of text. Each record has a key and a value. There is no requirement that data be sorted by key or that the keys must be unique. Similarly, MapReduce produces a set of records, each with a key and value.
MapReduce operates on data in jobs. Every job has one input and one output.[32] MapReduce breaks each job into a series of tasks. These tasks are of two primary types: map and reduce.
Map Phase
In the map phase, MapReduce gives the user an opportunity to operate on every record in the data set individually. This phase is commonly used to project out unwanted fields, transform fields, or apply filters. Certain types of joins and grouping can also be done in the map (e.g., joins where ...