book

Learning Spark

by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

February 2015

Intermediate to advanced

276 pages

7h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
AudienceHow This Book Is OrganizedSupporting BooksConventions Used in This BookCode ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMay 8, 2015Acknowledgments
1. Introduction to Data Analysis with Spark
What Is Apache Spark?A Unified StackSpark CoreSpark SQLSpark StreamingMLlibGraphXCluster ManagersWho Uses Spark, and for What?Data Science TasksData Processing ApplicationsA Brief History of SparkSpark Versions and ReleasesStorage Layers for Spark
2. Downloading Spark and Getting Started
Downloading SparkIntroduction to Spark’s Python and Scala ShellsIntroduction to Core Spark ConceptsStandalone ApplicationsInitializing a SparkContextBuilding Standalone ApplicationsConclusion
3. Programming with RDDs
RDD BasicsCreating RDDsRDD OperationsTransformationsActionsLazy EvaluationPassing Functions to SparkPythonScalaJavaCommon Transformations and ActionsBasic RDDsConverting Between RDD TypesPersistence (Caching)Conclusion
4. Working with Key/Value Pairs
MotivationCreating Pair RDDsTransformations on Pair RDDsAggregationsGrouping DataJoinsSorting DataActions Available on Pair RDDsData Partitioning (Advanced)Determining an RDD’s PartitionerOperations That Benefit from PartitioningOperations That Affect PartitioningExample: PageRankCustom PartitionersConclusion
5. Loading and Saving Your Data
MotivationFile FormatsText FilesJSONComma-Separated Values and Tab-Separated ValuesSequenceFilesObject FilesHadoop Input and Output FormatsFile CompressionFilesystemsLocal/“Regular” FSAmazon S3HDFSStructured Data with Spark SQLApache HiveJSONDatabasesJava Database ConnectivityCassandraHBaseElasticsearchConclusion
6. Advanced Spark Programming
IntroductionAccumulatorsAccumulators and Fault ToleranceCustom AccumulatorsBroadcast VariablesOptimizing BroadcastsWorking on a Per-Partition BasisPiping to External ProgramsNumeric RDD OperationsConclusion
7. Running on a Cluster
IntroductionSpark Runtime ArchitectureThe DriverExecutorsCluster ManagerLaunching a ProgramSummaryDeploying Applications with spark-submitPackaging Your Code and DependenciesA Java Spark Application Built with MavenA Scala Spark Application Built with sbtDependency ConflictsScheduling Within and Between Spark ApplicationsCluster ManagersStandalone Cluster ManagerHadoop YARNApache MesosAmazon EC2Which Cluster Manager to Use?Conclusion
8. Tuning and Debugging Spark
Configuring Spark with SparkConfComponents of Execution: Jobs, Tasks, and StagesFinding InformationSpark Web UIDriver and Executor LogsKey Performance ConsiderationsLevel of ParallelismSerialization FormatMemory ManagementHardware ProvisioningConclusion

9. Spark SQL
Linking with Spark SQLUsing Spark SQL in ApplicationsInitializing Spark SQLBasic Query ExampleDataFramesCachingLoading and Saving DataApache HiveData Sources/ParquetJSONFrom RDDsJDBC/ODBC ServerWorking with BeelineLong-Lived Tables and QueriesUser-Defined FunctionsSpark SQL UDFsHive UDFsSpark SQL PerformancePerformance Tuning OptionsConclusion
10. Spark Streaming
A Simple ExampleArchitecture and AbstractionTransformationsStateless TransformationsStateful TransformationsOutput OperationsInput SourcesCore SourcesAdditional SourcesMultiple Sources and Cluster Sizing24/7 OperationCheckpointingDriver Fault ToleranceWorker Fault ToleranceReceiver Fault ToleranceProcessing GuaranteesStreaming UIPerformance ConsiderationsBatch and Window SizesLevel of ParallelismGarbage Collection and Memory UsageConclusion
11. Machine Learning with MLlib
OverviewSystem RequirementsMachine Learning BasicsExample: Spam ClassificationData TypesWorking with VectorsAlgorithmsFeature ExtractionStatisticsClassification and RegressionClusteringCollaborative Filtering and RecommendationDimensionality ReductionModel EvaluationTips and Performance ConsiderationsPreparing FeaturesConfiguring AlgorithmsCaching RDDs to ReuseRecognizing SparsityLevel of ParallelismPipeline APIConclusion
Index

Overview

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449359034Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills