Skip to Content
Parallel R
book

Parallel R

by Q. Ethan McCallum, Stephen Weston
October 2011
Intermediate to advanced
126 pages
3h 10m
English
O'Reilly Media, Inc.
Content preview from Parallel R

Chapter 5. A Primer on MapReduce and Hadoop

Hadoop is an open-source framework for large-scale data storage and distributed computing, built on the MapReduce model. Doug Cutting initially created Hadoop as a component of the Nutch web crawler. It became its own project in 2006, and graduated to a top-level Apache project in 2008. During this time, Hadoop has experienced widespread adoption.

One of Hadoop’s strengths is that it is a general framework, applicable to a variety of domains and programming languages. One use case, and the common thread of the book’s remaining chapters, is to drive large R jobs.

This chapter explains some basics of MapReduce and Hadoop. It may feel a little out of place, as it’s not specific to R; but the content is too important to hide in an appendix.

Have no fear: I don’t dive into deep details here. There is a lot more to MapReduce and Hadoop than I could possibly cover in this book, let alone a chapter. I’ll provide just enough guidance to set you on your way. For a more thorough exploration I encourage you to read the Google MapReduce paper mentioned in , as well as Hadoop: The Definitive Guide by Tom White (O’Reilly).

If you already have a grasp on MapReduce and Hadoop, feel free to skip to the next chapter.

Hadoop at Cruising Altitude

When people think “Apache Hadoop,”[43] they often think about churning through terabytes of input across clusters made of tens or hundreds of machines, or nodes. Logfile processing is such an oft-cited use case, in fact, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Advanced R

Advanced R

Hadley Wickham
Learning R

Learning R

Richard Cotton
Mastering Spark with R

Mastering Spark with R

Javier Luraschi, Kevin Kuo, Edgar Ruiz

Publisher Resources

ISBN: 9781449317850Supplemental ContentErrata