book

Mastering Spark with R

by Javier Luraschi, Kevin Kuo, Edgar Ruiz

October 2019

Beginner to intermediate

293 pages

6h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

FormattingAcknowledgmentsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact Us
OverviewHadoopSparkR sparklyrRecap
OverviewPrerequisitesInstalling sparklyrInstalling SparkConnectingUsing SparkWeb InterfaceAnalysisModelingDataExtensionsDistributed RStreamingLogsDisconnectingUsing RStudioResourcesRecap
OverviewImportWrangleBuilt-in FunctionsCorrelationsVisualizeUsing ggplot2Using dbplotModelCachingCommunicateRecap
OverviewExploratory Data AnalysisFeature EngineeringSupervised LearningGeneralized Linear RegressionOther ModelsUnsupervised LearningData PreparationTopic ModelingRecap
OverviewCreationUse CasesHyperparameter TuningOperating ModesInteroperabilityDeploymentBatch ScoringReal-Time ScoringRecap
OverviewOn-PremisesManagersDistributionsCloudAmazonDatabricksGoogleIBMMicrosoftQuboleKubernetesToolsRStudioJupyterLivyRecap
OverviewEdge NodesSpark HomeLocalStandaloneYARNYARN ClientYARN ClusterLivyMesosKubernetesCloudBatchesToolsMultiple ConnectionsTroubleshootingLoggingSpark SubmitWindowsRecap
OverviewReading DataPathsSchemaMemoryColumnsWriting DataCopying DataFile FormatsCSVJSONParquetOthersFile SystemsStorage SystemsHiveCassandraJDBCRecap

OverviewGraphTimelineConfiguringConnect SettingsSubmit SettingsRuntime Settingssparklyr SettingsPartitioningImplicit PartitionsExplicit PartitionsCachingCheckpointingMemoryShufflingSerializationConfiguration FilesRecap
OverviewH2OGraphsXGBoostDeep LearningGenomicsSpatialTroubleshootingRecap
OverviewUse CasesCustom ParsersPartitioned ModelingGrid SearchWeb APIsSimulationsPartitionsGroupingColumnsContextFunctionsPackagesCluster RequirementsInstalling RApache ArrowTroubleshootingWorker LogsResolving TimeoutsInspecting PartitionsDebugging WorkersRecap
OverviewTransformationsAnalysisModelingPipelinesDistributed RKafkaShinyRecap
OverviewThe Spark APISpark ExtensionsUsing Scala CodeRecap
PrefaceFormattingChapter 1The World’s Capacity to Store InformationDaily Downloads of CRAN PackagesChapter 2PrerequisitesChapter 3Hive FunctionsChapter 4MLlib FunctionsChapter 6Google Trends for On-Premises (Mainframes), Cloud Computing, and KubernetesChapter 12Stream GeneratorInstalling Kafka

Content preview from Mastering Spark with R

Foreword

Apache Spark is a distributed computing platform built on extensibility: Spark’s APIs make it easy to combine input from many data sources and process it using diverse programming languages and algorithms to build a data application. R is one of the most powerful languages for data science and statistics, so it makes a lot of sense to connect R to Spark. Fortunately, R’s rich language features enable simple APIs for calling Spark from R that look similar to running R on local data sources. With a bit of background about both systems, you will be able to invoke massive computations in Spark or run your R code in parallel from the comfort of your favorite R programming environment.

This book explores using Spark from R in detail, focusing on the sparklyr package that enables support for dplyr and other packages known to the R community. It covers all of the main use cases in detail, ranging from querying data using the Spark engine to exploratory data analysis, machine learning, parallel execution of R code, and streaming. It also has a self-contained introduction to running Spark and monitoring job execution. The authors are exactly the right people to write about this topic—Javier, Kevin, and Edgar have been involved in sparklyr development since the project started. I was excited to see how well they’ve assembled this clear and focused guide about using Spark with R.

I hope that you enjoy this book and use it to scale up your R workloads and connect them to the capabilities ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781492046363Errata Page

Mastering Spark with R

by Javier Luraschi, Kevin Kuo, Edgar Ruiz

Foreword

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Advanced Machine Learning with R

Machine Learning with R, the tidyverse, and mlr

Advanced R

Regression Analysis with R

Publisher Resources

Foreword

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Advanced Machine Learning with R

Machine Learning with R, the tidyverse, and mlr

Advanced R

Regression Analysis with R

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.