book

Mastering Spark with R

by Javier Luraschi, Kevin Kuo, Edgar Ruiz

October 2019

Beginner to intermediate

293 pages

6h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

FormattingAcknowledgmentsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact Us
OverviewHadoopSparkR sparklyrRecap
OverviewPrerequisitesInstalling sparklyrInstalling SparkConnectingUsing SparkWeb InterfaceAnalysisModelingDataExtensionsDistributed RStreamingLogsDisconnectingUsing RStudioResourcesRecap
OverviewImportWrangleBuilt-in FunctionsCorrelationsVisualizeUsing ggplot2Using dbplotModelCachingCommunicateRecap
OverviewExploratory Data AnalysisFeature EngineeringSupervised LearningGeneralized Linear RegressionOther ModelsUnsupervised LearningData PreparationTopic ModelingRecap
OverviewCreationUse CasesHyperparameter TuningOperating ModesInteroperabilityDeploymentBatch ScoringReal-Time ScoringRecap
OverviewOn-PremisesManagersDistributionsCloudAmazonDatabricksGoogleIBMMicrosoftQuboleKubernetesToolsRStudioJupyterLivyRecap
OverviewEdge NodesSpark HomeLocalStandaloneYARNYARN ClientYARN ClusterLivyMesosKubernetesCloudBatchesToolsMultiple ConnectionsTroubleshootingLoggingSpark SubmitWindowsRecap
OverviewReading DataPathsSchemaMemoryColumnsWriting DataCopying DataFile FormatsCSVJSONParquetOthersFile SystemsStorage SystemsHiveCassandraJDBCRecap

OverviewGraphTimelineConfiguringConnect SettingsSubmit SettingsRuntime Settingssparklyr SettingsPartitioningImplicit PartitionsExplicit PartitionsCachingCheckpointingMemoryShufflingSerializationConfiguration FilesRecap
OverviewH2OGraphsXGBoostDeep LearningGenomicsSpatialTroubleshootingRecap
OverviewUse CasesCustom ParsersPartitioned ModelingGrid SearchWeb APIsSimulationsPartitionsGroupingColumnsContextFunctionsPackagesCluster RequirementsInstalling RApache ArrowTroubleshootingWorker LogsResolving TimeoutsInspecting PartitionsDebugging WorkersRecap
OverviewTransformationsAnalysisModelingPipelinesDistributed RKafkaShinyRecap
OverviewThe Spark APISpark ExtensionsUsing Scala CodeRecap
PrefaceFormattingChapter 1The World’s Capacity to Store InformationDaily Downloads of CRAN PackagesChapter 2PrerequisitesChapter 3Hive FunctionsChapter 4MLlib FunctionsChapter 6Google Trends for On-Premises (Mainframes), Cloud Computing, and KubernetesChapter 12Stream GeneratorInstalling Kafka

Content preview from Mastering Spark with R

Chapter 1. Introduction

You know nothing, Jon Snow.

—Ygritte

With information growing at exponential rates, it’s no surprise that historians are referring to this period of history as the Information Age. The increasing speed at which data is being collected has created new opportunities and is certainly poised to create even more. This chapter presents the tools that have been used to solve large-scale data challenges. First, it introduces Apache Spark as a leading tool that is democratizing our ability to process large datasets. With this as a backdrop, we introduce the R computing language, which was specifically designed to simplify data analysis. Finally, this leads us to introduce sparklyr, a project merging R and Spark into a powerful tool that is easily accessible to all.

Chapter 2, Getting Started presents the prerequisites, tools, and steps you need to perform to get Spark and R working on your personal computer. You will learn how to install and initialize Spark, get introduced to common operations, and get your very first data processing and modeling task done. It is the goal of that chapter to help anyone grasp the concepts and tools required to start tackling large-scale data challenges which, until recently, were accessible to just a few organizations.

You then move into learning how to analyze large-scale data, followed by building models capable of predicting trends and discover information hidden in vast amounts of information. At which point, you will have ...