book

High Performance Spark

by Holden Karau, Rachel Warren

May 2017

Intermediate to advanced

358 pages

10h 4m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
First Edition NotesSupporting Books and MaterialsConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact the AuthorsHow to Contact UsAcknowledgments
1. Introduction to High Performance Spark
What Is Spark and Why Performance MattersWhat You Can Expect to Get from This BookSpark VersionsWhy Scala?To Be a Spark Expert You Have to Learn a Little Scala AnywayThe Spark Scala API Is Easier to Use Than the Java APIScala Is More Performant Than PythonWhy Not Scala?Learning ScalaConclusion
2. How Spark Works
How Spark Fits into the Big Data EcosystemSpark ComponentsSpark Model of Parallel Computing: RDDsLazy EvaluationIn-Memory Persistence and Memory ManagementImmutability and the RDD InterfaceTypes of RDDsFunctions on RDDs: Transformations Versus ActionsWide Versus Narrow DependenciesSpark Job SchedulingResource Allocation Across ApplicationsThe Spark ApplicationThe Anatomy of a Spark JobThe DAGJobsStagesTasksConclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)Spark SQL DependenciesManaging Spark DependenciesAvoiding Hive JARsBasics of SchemasDataFrame APITransformationsMulti-DataFrame TransformationsPlain Old SQL Queries and Interacting with Hive DataData Representation in DataFrames and DatasetsTungstenData Loading and Saving FunctionsDataFrameWriter and DataFrameReaderFormatsSave ModesPartitions (Discovery and Writing)DatasetsInteroperability with RDDs, DataFrames, and Local CollectionsCompile-Time Strong TypingEasier Functional (RDD “like”) TransformationsRelational TransformationsMulti-Dataset Relational TransformationsGrouped Operations on DatasetsExtending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)Query OptimizerLogical and Physical PlansCode GenerationLarge Query Plans and Iterative AlgorithmsDebugging Spark SQL QueriesJDBC/ODBC ServerConclusion
4. Joins (SQL and Core)
Core Spark JoinsChoosing a Join TypeChoosing an Execution PlanSpark SQL JoinsDataFrame JoinsDataset JoinsConclusion
5. Effective Transformations
Narrow Versus Wide TransformationsImplications for PerformanceImplications for Fault ToleranceThe Special Case of coalesceWhat Type of RDD Does Your Transformation Return?Minimizing Object CreationReusing Existing ObjectsUsing Smaller Data StructuresIterator-to-Iterator Transformations with mapPartitionsWhat Is an Iterator-to-Iterator Transformation?Space and Time AdvantagesAn ExampleSet OperationsReducing Setup OverheadShared VariablesBroadcast VariablesAccumulatorsReusing RDDsCases for ReuseDeciding if Recompute Is Inexpensive EnoughTypes of Reuse: Cache, Persist, Checkpoint, Shuffle FilesAlluxio (nee Tachyon)LRU CachingNoisy Cluster ConsiderationsInteraction with AccumulatorsConclusion
6. Working with Key/Value Data
The Goldilocks ExampleGoldilocks Version 0: Iterative SolutionHow to Use PairRDDFunctions and OrderedRDDFunctionsActions on Key/Value PairsWhat’s So Dangerous About the groupByKey FunctionGoldilocks Version 1: groupByKey SolutionChoosing an Aggregation OperationDictionary of Aggregation Operations with Performance ConsiderationsMultiple RDD OperationsCo-GroupingPartitioners and Key/Value DataUsing the Spark Partitioner ObjectHash PartitioningRange PartitioningCustom PartitioningPreserving Partitioning Information Across TransformationsLeveraging Co-Located and Co-Partitioned RDDsDictionary of Mapping and Partitioning Functions PairRDDFunctionsDictionary of OrderedRDDOperationsSorting by Two Keys with SortByKeySecondary Sort and repartitionAndSortWithinPartitionsLeveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values FunctionHow Not to Sort by Two OrderingsGoldilocks Version 2: Secondary SortA Different Approach to GoldilocksGoldilocks Version 3: Sort on Cell ValuesStraggler Detection and Unbalanced DataBack to Goldilocks (Again)Goldilocks Version 4: Reduce to Distinct on Each PartitionConclusion
7. Going Beyond Scala
Beyond Scala within the JVMBeyond Scala, and Beyond the JVMHow PySpark WorksHow SparkR WorksSpark.jl (Julia Spark)How Eclair JS WorksSpark on the Common Language Runtime (CLR)—C# and FriendsCalling Other Languages from SparkUsing Pipe and FriendsJNIJava Native Access (JNA)Underneath Everything Is FORTRANGetting to the GPUThe FutureConclusion
8. Testing and Validation
Unit TestingGeneral Spark Unit TestingMocking RDDsGetting Test DataGenerating Large DatasetsSamplingProperty Checking with ScalaCheckComputing RDD DifferenceIntegration TestingChoosing Your Integration Testing EnvironmentVerifying PerformanceSpark Counters for Verifying PerformanceProjects for Verifying PerformanceJob ValidationConclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark MLWorking with MLlibGetting Started with MLlib (Organization and Imports)MLlib Feature Encoding and Data PreparationFeature Scaling and SelectionMLlib Model TrainingPredictingServing and PersistenceModel EvaluationWorking with Spark MLSpark ML Organization and ImportsPipeline StagesExplain ParamsData EncodingData CleaningSpark ML ModelsPutting It All Together in a PipelineTraining a PipelineAccessing Individual StagesData Persistence and Spark MLExtending Spark ML Pipelines with Your Own AlgorithmsModel and Pipeline Persistence and Serving with Spark MLGeneral Serving ConsiderationsConclusion

10. Spark Components and Packages
Stream Processing with SparkSources and SinksBatch IntervalsData Checkpoint IntervalsConsiderations for DStreamsConsiderations for Structured StreamingHigh Availability Mode (or Handling Driver Failure or Checkpointing)GraphXUsing Community Packages and LibrariesCreating a Spark PackageConclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
Spark Tuning and Cluster SizingHow to Adjust Spark SettingsHow to Determine the Relevant Information About Your ClusterBasic Spark Core Settings: How Many Resources to Allocate to the Spark Application?Calculating Executor and Driver Memory OverheadHow Large to Make the Spark DriverA Few Large Executors or Many Small Executors?Allocating Cluster Resources and Dynamic AllocationDividing the Space Within One ExecutorNumber and Size of PartitionsSerialization OptionsKryoSome Additional Debugging Techniques
Index

Content preview from High Performance Spark

Chapter 6. Working with Key/Value Data

Like any good distributed computing tool, Spark relies heavily on the key/value pair paradigm to define and parallelize operations, particularly wide transformations that require the data to be redistributed between machines. Anytime we want to perform grouped operations in parallel or change the ordering of records amongst machines—be it computing an aggregation statistic or merging customer records—the key/value functionality of Spark is useful as it allows us to easily parallelize our work. Spark has its own PairRDDFunctions class containing operations defined on RDDs of tuples. The PairRDDFunctions class, made available through implicit conversion, contains most of Spark’s methods for joins, and custom aggregations. The OrderedRDDFunctions class contains the methods for sorting. The OrderedRDDFunctions are available to RDDs of tuples in which the first element (the key) has an implicit ordering.

Note

Similar operations are available on Datasets as discussed in “Grouped Operations on Datasets”.

Despite their utility, key/value operations can lead to a number of performance issues. In fact, most expensive operations in Spark fit into the key/value pair paradigm because most wide transformations are key/value transformations, and most require some fine tuning and care to be performant. These performance considerations will be the focus of this chapter. We hope to provide not just a guide to using the functions in the PairRDDFunctions and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491943199Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

High Performance Spark

by Holden Karau, Rachel Warren

Chapter 6. Working with Key/Value Data

Note

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.