book

High Performance Spark

by Holden Karau, Rachel Warren

May 2017

Intermediate to advanced

358 pages

10h 4m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
First Edition NotesSupporting Books and MaterialsConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact the AuthorsHow to Contact UsAcknowledgments
1. Introduction to High Performance Spark
What Is Spark and Why Performance MattersWhat You Can Expect to Get from This BookSpark VersionsWhy Scala?To Be a Spark Expert You Have to Learn a Little Scala AnywayThe Spark Scala API Is Easier to Use Than the Java APIScala Is More Performant Than PythonWhy Not Scala?Learning ScalaConclusion
2. How Spark Works
How Spark Fits into the Big Data EcosystemSpark ComponentsSpark Model of Parallel Computing: RDDsLazy EvaluationIn-Memory Persistence and Memory ManagementImmutability and the RDD InterfaceTypes of RDDsFunctions on RDDs: Transformations Versus ActionsWide Versus Narrow DependenciesSpark Job SchedulingResource Allocation Across ApplicationsThe Spark ApplicationThe Anatomy of a Spark JobThe DAGJobsStagesTasksConclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)Spark SQL DependenciesManaging Spark DependenciesAvoiding Hive JARsBasics of SchemasDataFrame APITransformationsMulti-DataFrame TransformationsPlain Old SQL Queries and Interacting with Hive DataData Representation in DataFrames and DatasetsTungstenData Loading and Saving FunctionsDataFrameWriter and DataFrameReaderFormatsSave ModesPartitions (Discovery and Writing)DatasetsInteroperability with RDDs, DataFrames, and Local CollectionsCompile-Time Strong TypingEasier Functional (RDD “like”) TransformationsRelational TransformationsMulti-Dataset Relational TransformationsGrouped Operations on DatasetsExtending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)Query OptimizerLogical and Physical PlansCode GenerationLarge Query Plans and Iterative AlgorithmsDebugging Spark SQL QueriesJDBC/ODBC ServerConclusion
4. Joins (SQL and Core)
Core Spark JoinsChoosing a Join TypeChoosing an Execution PlanSpark SQL JoinsDataFrame JoinsDataset JoinsConclusion
5. Effective Transformations
Narrow Versus Wide TransformationsImplications for PerformanceImplications for Fault ToleranceThe Special Case of coalesceWhat Type of RDD Does Your Transformation Return?Minimizing Object CreationReusing Existing ObjectsUsing Smaller Data StructuresIterator-to-Iterator Transformations with mapPartitionsWhat Is an Iterator-to-Iterator Transformation?Space and Time AdvantagesAn ExampleSet OperationsReducing Setup OverheadShared VariablesBroadcast VariablesAccumulatorsReusing RDDsCases for ReuseDeciding if Recompute Is Inexpensive EnoughTypes of Reuse: Cache, Persist, Checkpoint, Shuffle FilesAlluxio (nee Tachyon)LRU CachingNoisy Cluster ConsiderationsInteraction with AccumulatorsConclusion
6. Working with Key/Value Data
The Goldilocks ExampleGoldilocks Version 0: Iterative SolutionHow to Use PairRDDFunctions and OrderedRDDFunctionsActions on Key/Value PairsWhat’s So Dangerous About the groupByKey FunctionGoldilocks Version 1: groupByKey SolutionChoosing an Aggregation OperationDictionary of Aggregation Operations with Performance ConsiderationsMultiple RDD OperationsCo-GroupingPartitioners and Key/Value DataUsing the Spark Partitioner ObjectHash PartitioningRange PartitioningCustom PartitioningPreserving Partitioning Information Across TransformationsLeveraging Co-Located and Co-Partitioned RDDsDictionary of Mapping and Partitioning Functions PairRDDFunctionsDictionary of OrderedRDDOperationsSorting by Two Keys with SortByKeySecondary Sort and repartitionAndSortWithinPartitionsLeveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values FunctionHow Not to Sort by Two OrderingsGoldilocks Version 2: Secondary SortA Different Approach to GoldilocksGoldilocks Version 3: Sort on Cell ValuesStraggler Detection and Unbalanced DataBack to Goldilocks (Again)Goldilocks Version 4: Reduce to Distinct on Each PartitionConclusion
7. Going Beyond Scala
Beyond Scala within the JVMBeyond Scala, and Beyond the JVMHow PySpark WorksHow SparkR WorksSpark.jl (Julia Spark)How Eclair JS WorksSpark on the Common Language Runtime (CLR)—C# and FriendsCalling Other Languages from SparkUsing Pipe and FriendsJNIJava Native Access (JNA)Underneath Everything Is FORTRANGetting to the GPUThe FutureConclusion
8. Testing and Validation
Unit TestingGeneral Spark Unit TestingMocking RDDsGetting Test DataGenerating Large DatasetsSamplingProperty Checking with ScalaCheckComputing RDD DifferenceIntegration TestingChoosing Your Integration Testing EnvironmentVerifying PerformanceSpark Counters for Verifying PerformanceProjects for Verifying PerformanceJob ValidationConclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark MLWorking with MLlibGetting Started with MLlib (Organization and Imports)MLlib Feature Encoding and Data PreparationFeature Scaling and SelectionMLlib Model TrainingPredictingServing and PersistenceModel EvaluationWorking with Spark MLSpark ML Organization and ImportsPipeline StagesExplain ParamsData EncodingData CleaningSpark ML ModelsPutting It All Together in a PipelineTraining a PipelineAccessing Individual StagesData Persistence and Spark MLExtending Spark ML Pipelines with Your Own AlgorithmsModel and Pipeline Persistence and Serving with Spark MLGeneral Serving ConsiderationsConclusion

10. Spark Components and Packages
Stream Processing with SparkSources and SinksBatch IntervalsData Checkpoint IntervalsConsiderations for DStreamsConsiderations for Structured StreamingHigh Availability Mode (or Handling Driver Failure or Checkpointing)GraphXUsing Community Packages and LibrariesCreating a Spark PackageConclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
Spark Tuning and Cluster SizingHow to Adjust Spark SettingsHow to Determine the Relevant Information About Your ClusterBasic Spark Core Settings: How Many Resources to Allocate to the Spark Application?Calculating Executor and Driver Memory OverheadHow Large to Make the Spark DriverA Few Large Executors or Many Small Executors?Allocating Cluster Resources and Dynamic AllocationDividing the Space Within One ExecutorNumber and Size of PartitionsSerialization OptionsKryoSome Additional Debugging Techniques
Index

Content preview from High Performance Spark

Preface

We wrote this book for data engineers and data scientists who are looking to get the most out of Spark. If you’ve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent failures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but have not felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but have not seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing. For recommendations of more introductory literature see “Supporting Books and Materials”.

We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, usually more intuitive to the data scientist. Thus it may be more useful to a data engineer who may be less experienced with thinking critically about the statistical nature, distribution, and layout of data when considering performance. We hope that this book will ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491943199Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

High Performance Spark

by Holden Karau, Rachel Warren

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.