Chapter 6. How MapReduce Works
In this chapter, we look at how MapReduce in Hadoop works in detail. This knowledge provides a good foundation for writing more advanced MapReduce programs, which we will cover in the following two chapters.
Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single method call:
submit() on a
Job object (note that you
can also call
which submits the job if it hasn’t been submitted already, then waits for
it to finish). This method call conceals a great deal of processing behind
the scenes. This section uncovers the steps Hadoop takes to run a
We saw in Chapter 5 that the way Hadoop executes a MapReduce program depends on a couple of configuration settings.
In versions of Hadoop up to and
including the 0.20 release series,
determines the means of execution. If this configuration property
is set to
local (the default), the
local job runner is used. This runner runs the whole job in a single JVM.
It’s designed for testing and for running MapReduce programs on small
mapred.job.tracker is set
to a colon-separated host and port pair, then the property is interpreted
as a jobtracker address, and the runner submits the job to the jobtracker
at that address. The whole process is described in detail in the next
In Hadoop 2.0, a new MapReduce implementation was introduced. The new implementation (called MapReduce 2) is built on a system called YARN, described in YARN ...