Chapter 6. How MapReduce Works
In this chapter, we look at how MapReduce in Hadoop works in detail. This knowledge provides a good foundation for writing more advanced MapReduce programs, which we will cover in the following two chapters.
Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single method call: submit()
on a Job
object (note that you
can also call waitForCompletion()
,
which submits the job if it hasn’t been submitted already, then waits for
it to finish).[51] This method call conceals a great deal of processing behind
the scenes. This section uncovers the steps Hadoop takes to run a
job.
We saw in Chapter 5 that the way Hadoop executes a MapReduce program depends on a couple of configuration settings.
In versions of Hadoop up to and
including the 0.20 release series, mapred.
job.tracker
determines the means of execution. If this configuration property
is set to local
(the default), the
local job runner is used. This runner runs the whole job in a single JVM.
It’s designed for testing and for running MapReduce programs on small
datasets.
Alternatively, if mapred.job.tracker
is set
to a colon-separated host and port pair, then the property is interpreted
as a jobtracker address, and the runner submits the job to the jobtracker
at that address. The whole process is described in detail in the next
section.
In Hadoop 2.0, a new MapReduce implementation was introduced. The new implementation (called MapReduce 2) is built on a system called YARN, described in YARN ...
Get Hadoop: The Definitive Guide, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.