This chapter sets the pace for the rest of the book. If you’re in a hurry, feel free to skip to the chapter you need. (The section In a Hurry? has a quick-ref look at the various strategies and where they fit. That should help you pick a starting point.) Just make sure you come back here to understand our choice of vocabulary, how we chose what to cover, and so on.
It’s tough to argue with R. Who could dislike a high-quality, cross-platform, open-source statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to repeat a process you’ve captured. It has a lot of statistical calculations built-in so you don’t have to reinvent the wheel. Did we mention that R is free?
When the base toolset isn’t enough, R users have access to a rich ecosystem of add-on packages and a gaggle of GUIs to make their lives even easier. No wonder R has become a favorite in the age of Big Data.
Since R is perfect, then, we can end this book. Right?
Not quite. It’s precisely the Big Data age that has exposed R’s blemishes.
These imperfections stem not from defects in the software itself, but from the passage of time: quite simply, R was not built in anticipation of the Big Data revolution.
R was born in 1995. Disk space was expensive, RAM even more so, and this thing called The Internet was just getting its legs. Notions of “large-scale data analysis” and “high-performance computing” were reasonably rare. Outside of Wall Street firms and university research labs, there just wasn’t that much data to crunch.
Fast-forward to the present day and hardware costs just a fraction of what it used to. Computing power is available online for pennies. Everyone is suddenly interested in collecting and analyzing data, and the necessary resources are well within reach.
This surge in data analysis has brought two of R’s limitations to the forefront: it’s single-threaded and memory-bound. Allow us to explain:
- It’s single-threaded
The R language has no explicit constructs for parallelism, such as threads or mutexes. An out-of-the-box R install cannot take advantage of multiple CPUs.
- It’s memory-bound
R requires that your entire dataset[1] fit in memory (RAM).[2] Four gigabytes of RAM will not hold eight gigabytes of data, no matter how much you smile when you ask.
While these are certainly inconvenient, they’re hardly insurmountable.
People have created a series of workarounds over the years. Doing a lot of matrix math? You can build R against a multithreaded basic linear algebra subprogram (BLAS). Churning through large datasets? Use a relational database or another manual method to retrieve your data in smaller, more manageable pieces. And so on, and so forth.
Some big winners involve parallelism. Spreading work across multiple CPUs overcomes R’s single-threaded nature. Offloading work to multiple machines reaps the multi-process benefit and also addresses R’s memory barrier. In this book we’ll cover a few strategies to give R that parallel boost, specifically those which take advantage of modern multicore hardware and cheap distributed computing.
Now that we’ve set the tone for why we’re here, let’s take a look at what we plan to accomplish in the coming pages (or screens if you’re reading this electronically).
Each chapter is a look into one strategy for R parallelism, including:
What it is
Where to find it
How to use it
Where it works well, and where it doesn’t
First up is the snow
package,
followed by a tour of the multicore
package. We then provide a look at the new parallel
package that’s due to arrive in R
2.14. After that, we’ll take a brief side-tour to explain MapReduce and
Hadoop. That will serve as a foundation for the remaining chapters:
R+Hadoop (Hadoop streaming and the Java API), RHIPE
, and segue
.
In Chapter 9, we will briefly mention some tools that were too new for us to cover in-depth.
There will likely be other tools we hadn’t heard about (or that didn’t exist) at the time of writing.[3] Please let us know about them! You can reach us through this book’s website at http://parallelrbook.com/.
This is a book about R, yes, but we’ll expect you know the basics
of how to get around. If you’re new to R or need a refresher course,
please flip through Paul Teetor’s R
Cookbook (O’Reilly), Robert Kabacoff’s
R In
Action (Manning), or another introductory title. You
should take particular note of the lapply()
function, which plays an important
role in this book.
Some of the topics require several machines’ worth of infrastructure, in which case you’ll need access to a talented sysadmin. You’ll also need hardware, which you can buy and maintain yourself, or rent from a hosting provider. Cloud services, notably Amazon Web Services (AWS), [4] have become a popular choice in this arena. AWS has plenty of documentation, and you can also read Programming Amazon EC2, by Jurg van Vliet and Flavia Paganelli (O’Reilly) as a supplement.
(Please note that using a provider still requires a degree of sysadmin knowledge. If you’re not up to the task, you’ll want to find and bribe your skilled sysadmin friends.)
If you’re in a hurry, you can skip straight to the chapter you need. The list below is a quick look at the various strategies.
Overview: Good for use on
traditional clusters, especially if MPI is available. It supports MPI,
PVM, nws
, and sockets for
communication, and is quite portable, running on Linux, Mac OS X, and
Windows.
Solves: Single-threaded, memory-bound.
Pros: Mature, popular package; leverages MPI’s speed without its complexity.
Cons: Can be difficult to configure.
Overview: Good for big-CPU problems when setting up a Hadoop cluster is too much of a hassle. Lets you parallelize your R code without ever leaving the R interpreter.
Solves: Single-threaded.
Pros: Simple and efficient; easy to install; no configuration needed.
Cons: Can only use one machine; doesn’t support Windows; no built-in support for parallel random number generation (RNG).
Overview: A merger of snow
and multicore
that comes built into R as of R
2.14.0.
Solves: Single-threaded, memory-bound.
Pros: No installation necessary; has great support for parallel random number generation.
Cons: Can only use one machine on Windows; can be difficult to configure on multiple Linux machines.
Overview: Run your R code on a Hadoop cluster.
Solves: Single-threaded, memory-bound.
Pros: You get Hadoop’s scalability.
Cons: Requires a Hadoop cluster (internal or cloud-based); breaks up a single logical process into multiple scripts and steps (can be a hassle for exploratory work).
Overview: Talk Hadoop without ever leaving the R interpreter.
Solves: Single-threaded, memory-bound.
Pros: Closer to a native R experience than R+Hadoop; use pure R code for your MapReduce operations.
Cons: Requires a Hadoop cluster; requires extra setup on the cluster; cannot process standard SequenceFiles (for binary data).
Welcome to the beginning of your journey into parallel R. Our first
stop is a look at the popular snow
package.
[1] We emphasize “dataset” here, not necessarily “algorithms.”
[2] It’s a big problem. Because R will often make multiple
copies of the same data structure for no apparent reason, you
often need three times as much memory as the size of your
dataset. And if you don’t have enough memory, you die a slow
death as your poor machine swaps and thrashes. Some people turn
off virtual memory with the swapoff
command so they can die
quickly.
[3] Try as we might, our massive Monte Carlo simulations have brought us no closer to predicting the next R parallelism strategy. Nor any winning lottery numbers, for that matter.
Get Parallel R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.