Chapter 6. R+Hadoop

Of the three Hadoop-related strategies we discuss in this book, this is the most raw: you get to spend time up close and personal with the system. On the one hand, that means you have to understand Hadoop. On the other hand, it gives you the most control. I’ll walk you through Hadoop programming basics and then explain how to use it to run your R code.

If you skipped straight to this chapter, but you’re new to Hadoop, you’ll want to review Chapter 5.

Quick Look

Motivation: You need to run the same R code many times over different parameters or inputs. For example, you plan to test an algorithm over a series of historical data.

Solution: Use a Hadoop cluster to run your R code.

Good because: Hadoop distributes work across a cluster of machines. As such, using Hadoop as a driver overcomes R’s single-threaded limitation as well as its memory boundaries.

How It Works

There are several ways to submit work to a cluster, two of which are relevant to R users: streaming and the Java API.

In streaming, you write your Map and Reduce operations as R scripts. (Well, streaming lets you write Map and Reduce code in pretty much any scripting language; but since this is a book about R, let’s pretend that R is all that exists.) The Hadoop framework launches your R scripts at the appropriate times and communicates with them via standard input and standard output.

By comparison, when using the Java API, your Map and Reduce operations are written in Java. Your Java code, in turn, invokes ...

Get Parallel R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.