Other Packages for Parallel Computation with R
Segue
The segue package by JD
Long is a great choice for running simple parallel programs; it’s
intended to be a gentle introduction to parallel computation. Segue runs
programs in the cloud using AWS’s Elastic MapReduce service. (This is a
distinct product from EC2, which I used to install my own private Hadoop
cluster.) It borrows some Hadoop infrastructure, but it isn’t a full
map/reduce package. Segue is modeled
on the apply function in R; you use
it to apply a function to a data set across a set of computers in the
cloud. Let’s show how it works.
The segue package is hosted on
Google Code, not CRAN. To install it, you can use the install_url command in the devtools package:
> library(devtools) > # At the time I wrote this book, the current version was 0.05; > # make sure to change the link to get the latest version: > install_url("http://segue.googlecode.com/files/segue_0.05.tar.gz")
You’ll need an Amazon Web Services account to use it.
Warning
You will be billed by the hour for using AWS. Make sure that you understand how you will be charged and how to use AWS before you start.
You’ll need to get your Access Key ID and Secret Access Key from AWS’s Security Credentials page.
> library(segue) Loading required package: rJava Loading required package: caTools Loading required package: bitops Segue did not find your AWS credentials. Please run the setCredentials() function. > # set aws.access.id to your amazon access id, aws.secret.key to ...