book

Parallel R

by Q. Ethan McCallum, Stephen Weston

October 2011

Intermediate to advanced

126 pages

3h 10m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsQ. Ethan McCallumStephen Weston
Why R?Why Not R?The Solution: Parallel ExecutionA Road Map for This BookWhat We’ll CoverLooking Forward…What We’ll Assume You Already KnowIn a Hurry?snowmulticoreparallelR+HadoopRHIPESegueSummary
Quick LookHow It WorksSetting UpWorking with ItCreating Clusters with makeClusterParallel K-MeansInitializing WorkersLoad Balancing with clusterApplyLBTask Chunking with parLapplyVectorizing with clusterSplitLoad Balancing ReduxFunctions and EnvironmentsRandom Number Generationsnow ConfigurationInstalling RmpiExecuting snow Programs on a Cluster with RmpiExecuting snow Programs with a Batch Queueing SystemTroubleshooting snow ProgramsWhen It Works……And When It Doesn’tThe Wrap-up
Quick LookHow It WorksSetting UpWorking with ItThe mclapply FunctionThe mc.cores OptionThe mc.set.seed OptionLoad Balancing with mclapplyThe pvec FunctionThe parallel and collect FunctionsUsing collect OptionsParallel Random Number GenerationThe Low-Level APIWhen It Works……And When It Doesn’tThe Wrap-up
Quick LookHow It WorksSetting UpWorking with ItGetting StartedCreating Clusters with makeClusterParallel Random Number GenerationSummary of DifferencesWhen It Works……And When It Doesn’tThe Wrap-up
Hadoop at Cruising AltitudeA MapReduce PrimerThinking in MapReduce: Some Pseudocode ExamplesCalculate Average Call Length for Each DateNumber of Calls by Each User, on Each DateRun a Special Algorithm on Each RecordBinary and Whole-File Data: SequenceFilesNo Cluster? No Problem! Look to the Clouds…The Wrap-up
Quick LookHow It WorksSetting UpWorking with ItSimple Hadoop Streaming (All Text)Streaming, Redux: Indirectly Working with Binary DataThe Java API: Binary Input and OutputProcessing Related Groups (the Full Map and Reduce Phases)When It Works……And When It Doesn’tThe Wrap-up

Quick LookHow It WorksSetting UpWorking with ItPhone Call Records, ReduxTweet BrevityMore Complex Tweet AnalysisWhen It Works……And When It Doesn’tThe Wrap-up
Quick LookHow It WorksSetting UpWorking with ItModel Testing: Parameter SweepWhen It Works……And When It Doesn’tThe Wrap-up
doRedisRevoScale R and RevoConnectR (RHadoop)cloudNumbers.com

Content preview from Parallel R

Chapter 5. A Primer on MapReduce and Hadoop

Hadoop is an open-source framework for large-scale data storage and distributed computing, built on the MapReduce model. Doug Cutting initially created Hadoop as a component of the Nutch web crawler. It became its own project in 2006, and graduated to a top-level Apache project in 2008. During this time, Hadoop has experienced widespread adoption.

One of Hadoop’s strengths is that it is a general framework, applicable to a variety of domains and programming languages. One use case, and the common thread of the book’s remaining chapters, is to drive large R jobs.

This chapter explains some basics of MapReduce and Hadoop. It may feel a little out of place, as it’s not specific to R; but the content is too important to hide in an appendix.

Have no fear: I don’t dive into deep details here. There is a lot more to MapReduce and Hadoop than I could possibly cover in this book, let alone a chapter. I’ll provide just enough guidance to set you on your way. For a more thorough exploration I encourage you to read the Google MapReduce paper mentioned in , as well as Hadoop: The Definitive Guide by Tom White (O’Reilly).

If you already have a grasp on MapReduce and Hadoop, feel free to skip to the next chapter.

Hadoop at Cruising Altitude

When people think “Apache Hadoop,”^[43] they often think about churning through terabytes of input across clusters made of tens or hundreds of machines, or nodes. Logfile processing is such an oft-cited use case, in fact, ...