book

High Performance Spark

by Holden Karau, Rachel Warren

May 2017

Intermediate to advanced

358 pages

10h 4m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
First Edition NotesSupporting Books and MaterialsConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact the AuthorsHow to Contact UsAcknowledgments
1. Introduction to High Performance Spark
What Is Spark and Why Performance MattersWhat You Can Expect to Get from This BookSpark VersionsWhy Scala?To Be a Spark Expert You Have to Learn a Little Scala AnywayThe Spark Scala API Is Easier to Use Than the Java APIScala Is More Performant Than PythonWhy Not Scala?Learning ScalaConclusion
2. How Spark Works
How Spark Fits into the Big Data EcosystemSpark ComponentsSpark Model of Parallel Computing: RDDsLazy EvaluationIn-Memory Persistence and Memory ManagementImmutability and the RDD InterfaceTypes of RDDsFunctions on RDDs: Transformations Versus ActionsWide Versus Narrow DependenciesSpark Job SchedulingResource Allocation Across ApplicationsThe Spark ApplicationThe Anatomy of a Spark JobThe DAGJobsStagesTasksConclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)Spark SQL DependenciesManaging Spark DependenciesAvoiding Hive JARsBasics of SchemasDataFrame APITransformationsMulti-DataFrame TransformationsPlain Old SQL Queries and Interacting with Hive DataData Representation in DataFrames and DatasetsTungstenData Loading and Saving FunctionsDataFrameWriter and DataFrameReaderFormatsSave ModesPartitions (Discovery and Writing)DatasetsInteroperability with RDDs, DataFrames, and Local CollectionsCompile-Time Strong TypingEasier Functional (RDD “like”) TransformationsRelational TransformationsMulti-Dataset Relational TransformationsGrouped Operations on DatasetsExtending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)Query OptimizerLogical and Physical PlansCode GenerationLarge Query Plans and Iterative AlgorithmsDebugging Spark SQL QueriesJDBC/ODBC ServerConclusion
4. Joins (SQL and Core)
Core Spark JoinsChoosing a Join TypeChoosing an Execution PlanSpark SQL JoinsDataFrame JoinsDataset JoinsConclusion
5. Effective Transformations
Narrow Versus Wide TransformationsImplications for PerformanceImplications for Fault ToleranceThe Special Case of coalesceWhat Type of RDD Does Your Transformation Return?Minimizing Object CreationReusing Existing ObjectsUsing Smaller Data StructuresIterator-to-Iterator Transformations with mapPartitionsWhat Is an Iterator-to-Iterator Transformation?Space and Time AdvantagesAn ExampleSet OperationsReducing Setup OverheadShared VariablesBroadcast VariablesAccumulatorsReusing RDDsCases for ReuseDeciding if Recompute Is Inexpensive EnoughTypes of Reuse: Cache, Persist, Checkpoint, Shuffle FilesAlluxio (nee Tachyon)LRU CachingNoisy Cluster ConsiderationsInteraction with AccumulatorsConclusion
6. Working with Key/Value Data
The Goldilocks ExampleGoldilocks Version 0: Iterative SolutionHow to Use PairRDDFunctions and OrderedRDDFunctionsActions on Key/Value PairsWhat’s So Dangerous About the groupByKey FunctionGoldilocks Version 1: groupByKey SolutionChoosing an Aggregation OperationDictionary of Aggregation Operations with Performance ConsiderationsMultiple RDD OperationsCo-GroupingPartitioners and Key/Value DataUsing the Spark Partitioner ObjectHash PartitioningRange PartitioningCustom PartitioningPreserving Partitioning Information Across TransformationsLeveraging Co-Located and Co-Partitioned RDDsDictionary of Mapping and Partitioning Functions PairRDDFunctionsDictionary of OrderedRDDOperationsSorting by Two Keys with SortByKeySecondary Sort and repartitionAndSortWithinPartitionsLeveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values FunctionHow Not to Sort by Two OrderingsGoldilocks Version 2: Secondary SortA Different Approach to GoldilocksGoldilocks Version 3: Sort on Cell ValuesStraggler Detection and Unbalanced DataBack to Goldilocks (Again)Goldilocks Version 4: Reduce to Distinct on Each PartitionConclusion
7. Going Beyond Scala
Beyond Scala within the JVMBeyond Scala, and Beyond the JVMHow PySpark WorksHow SparkR WorksSpark.jl (Julia Spark)How Eclair JS WorksSpark on the Common Language Runtime (CLR)—C# and FriendsCalling Other Languages from SparkUsing Pipe and FriendsJNIJava Native Access (JNA)Underneath Everything Is FORTRANGetting to the GPUThe FutureConclusion
8. Testing and Validation
Unit TestingGeneral Spark Unit TestingMocking RDDsGetting Test DataGenerating Large DatasetsSamplingProperty Checking with ScalaCheckComputing RDD DifferenceIntegration TestingChoosing Your Integration Testing EnvironmentVerifying PerformanceSpark Counters for Verifying PerformanceProjects for Verifying PerformanceJob ValidationConclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark MLWorking with MLlibGetting Started with MLlib (Organization and Imports)MLlib Feature Encoding and Data PreparationFeature Scaling and SelectionMLlib Model TrainingPredictingServing and PersistenceModel EvaluationWorking with Spark MLSpark ML Organization and ImportsPipeline StagesExplain ParamsData EncodingData CleaningSpark ML ModelsPutting It All Together in a PipelineTraining a PipelineAccessing Individual StagesData Persistence and Spark MLExtending Spark ML Pipelines with Your Own AlgorithmsModel and Pipeline Persistence and Serving with Spark MLGeneral Serving ConsiderationsConclusion

10. Spark Components and Packages
Stream Processing with SparkSources and SinksBatch IntervalsData Checkpoint IntervalsConsiderations for DStreamsConsiderations for Structured StreamingHigh Availability Mode (or Handling Driver Failure or Checkpointing)GraphXUsing Community Packages and LibrariesCreating a Spark PackageConclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
Spark Tuning and Cluster SizingHow to Adjust Spark SettingsHow to Determine the Relevant Information About Your ClusterBasic Spark Core Settings: How Many Resources to Allocate to the Spark Application?Calculating Executor and Driver Memory OverheadHow Large to Make the Spark DriverA Few Large Executors or Many Small Executors?Allocating Cluster Resources and Dynamic AllocationDividing the Space Within One ExecutorNumber and Size of PartitionsSerialization OptionsKryoSome Additional Debugging Techniques
Index

Content preview from High Performance Spark

Chapter 1. Introduction to High Performance Spark

This chapter provides an overview of what we hope you will be able to learn from this book and does its best to convince you to learn Scala. Feel free to skip ahead to Chapter 2 if you already know what you’re looking for and use Scala (or have your heart set on another language).

What Is Spark and Why Performance Matters

Apache Spark is a high-performance, general-purpose distributed computing system that has become the most active Apache open source project, with more than 1,000 active contributors.¹ Spark enables us to process large quantities of data, beyond what can fit on a single machine, with a high-level, relatively easy-to-use API. Spark’s design and interface are unique, and it is one of the fastest systems of its kind. Uniquely, Spark allows us to write the logic of data transformations and machine learning algorithms in a way that is parallelizable, but relatively system agnostic. So it is often possible to write computations that are fast for distributed storage systems of varying kind and size.

However, despite its many advantages and the excitement around Spark, the simplest implementation of many common data science routines in Spark can be much slower and much less robust than the best version. Since the computations we are concerned with may involve data at a very large scale, the time and resources that gains from tuning code for performance are enormous. Performance does not just mean run faster; often at ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491943199Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

High Performance Spark

by Holden Karau, Rachel Warren

Chapter 1. Introduction to High Performance Spark

What Is Spark and Why Performance Matters

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.