Chapter 6. How MapReduce Works
In this chapter, we look at how MapReduce in Hadoop works in detail. This knowledge provides a good foundation for writing more advanced MapReduce programs, which we will cover in the following two chapters.
Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single method call: submit() on a Job object (note that you
can also call waitForCompletion(),
which submits the job if it hasn’t been submitted already, then waits for
it to finish).[51] This method call conceals a great deal of processing behind
the scenes. This section uncovers the steps Hadoop takes to run a
job.
We saw in Chapter 5 that the way Hadoop executes a MapReduce program depends on a couple of configuration settings.
In versions of Hadoop up to and
including the 0.20 release series, mapred.job.tracker
determines the means of execution. If this configuration property
is set to local (the default), the
local job runner is used. This runner runs the whole job in a single JVM.
It’s designed for testing and for running MapReduce programs on small
datasets.
Alternatively, if mapred.job.tracker is set
to a colon-separated host and port pair, then the property is interpreted
as a jobtracker address, and the runner submits the job to the jobtracker
at that address. The whole process is described in detail in the next
section.
In Hadoop 2.0, a new MapReduce implementation was introduced. The new implementation (called MapReduce 2) is built on a system called YARN, described in YARN ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access