Chapter 6. How MapReduce Works

In this chapter, we look at how MapReduce in Hadoop works in detail. This knowledge provides a good foundation for writing more advanced MapReduce programs, which we will cover in the following two chapters.

Anatomy of a MapReduce Job Run

You can run a MapReduce job with a single method call: submit() on a Job object (note that you can also call waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits for it to finish).[51] This method call conceals a great deal of processing behind the scenes. This section uncovers the steps Hadoop takes to run a job.

We saw in Chapter 5 that the way Hadoop executes a MapReduce program depends on a couple of configuration settings.

In versions of Hadoop up to and including the 0.20 release series, mapred.job.tracker determines the means of execution. If this configuration property is set to local (the default), the local job runner is used. This runner runs the whole job in a single JVM. It’s designed for testing and for running MapReduce programs on small datasets.

Alternatively, if mapred.job.tracker is set to a colon-separated host and port pair, then the property is interpreted as a jobtracker address, and the runner submits the job to the jobtracker at that address. The whole process is described in detail in the next section.

In Hadoop 2.0, a new MapReduce implementation was introduced. The new implementation (called MapReduce 2) is built on a system called YARN, described in YARN ...

Get Hadoop: The Definitive Guide, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.