Skip to Content
Parallel R
book

Parallel R

by Q. Ethan McCallum, Stephen Weston
October 2011
Intermediate to advanced
126 pages
3h 10m
English
O'Reilly Media, Inc.
Content preview from Parallel R

Chapter 6. R+Hadoop

Of the three Hadoop-related strategies we discuss in this book, this is the most raw: you get to spend time up close and personal with the system. On the one hand, that means you have to understand Hadoop. On the other hand, it gives you the most control. I’ll walk you through Hadoop programming basics and then explain how to use it to run your R code.

If you skipped straight to this chapter, but you’re new to Hadoop, you’ll want to review Chapter 5.

Quick Look

Motivation: You need to run the same R code many times over different parameters or inputs. For example, you plan to test an algorithm over a series of historical data.

Solution: Use a Hadoop cluster to run your R code.

Good because: Hadoop distributes work across a cluster of machines. As such, using Hadoop as a driver overcomes R’s single-threaded limitation as well as its memory boundaries.

How It Works

There are several ways to submit work to a cluster, two of which are relevant to R users: streaming and the Java API.

In streaming, you write your Map and Reduce operations as R scripts. (Well, streaming lets you write Map and Reduce code in pretty much any scripting language; but since this is a book about R, let’s pretend that R is all that exists.) The Hadoop framework launches your R scripts at the appropriate times and communicates with them via standard input and standard output.

By comparison, when using the Java API, your Map and Reduce operations are written in Java. Your Java code, in turn, invokes ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Advanced R

Advanced R

Hadley Wickham
Learning R

Learning R

Richard Cotton
Mastering Spark with R

Mastering Spark with R

Javier Luraschi, Kevin Kuo, Edgar Ruiz

Publisher Resources

ISBN: 9781449317850Supplemental ContentErrata