book

High Performance Spark

by Holden Karau, Rachel Warren

May 2017

Intermediate to advanced

358 pages

10h 4m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
First Edition NotesSupporting Books and MaterialsConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact the AuthorsHow to Contact UsAcknowledgments
1. Introduction to High Performance Spark
What Is Spark and Why Performance MattersWhat You Can Expect to Get from This BookSpark VersionsWhy Scala?To Be a Spark Expert You Have to Learn a Little Scala AnywayThe Spark Scala API Is Easier to Use Than the Java APIScala Is More Performant Than PythonWhy Not Scala?Learning ScalaConclusion
2. How Spark Works
How Spark Fits into the Big Data EcosystemSpark ComponentsSpark Model of Parallel Computing: RDDsLazy EvaluationIn-Memory Persistence and Memory ManagementImmutability and the RDD InterfaceTypes of RDDsFunctions on RDDs: Transformations Versus ActionsWide Versus Narrow DependenciesSpark Job SchedulingResource Allocation Across ApplicationsThe Spark ApplicationThe Anatomy of a Spark JobThe DAGJobsStagesTasksConclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)Spark SQL DependenciesManaging Spark DependenciesAvoiding Hive JARsBasics of SchemasDataFrame APITransformationsMulti-DataFrame TransformationsPlain Old SQL Queries and Interacting with Hive DataData Representation in DataFrames and DatasetsTungstenData Loading and Saving FunctionsDataFrameWriter and DataFrameReaderFormatsSave ModesPartitions (Discovery and Writing)DatasetsInteroperability with RDDs, DataFrames, and Local CollectionsCompile-Time Strong TypingEasier Functional (RDD “like”) TransformationsRelational TransformationsMulti-Dataset Relational TransformationsGrouped Operations on DatasetsExtending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)Query OptimizerLogical and Physical PlansCode GenerationLarge Query Plans and Iterative AlgorithmsDebugging Spark SQL QueriesJDBC/ODBC ServerConclusion
4. Joins (SQL and Core)
Core Spark JoinsChoosing a Join TypeChoosing an Execution PlanSpark SQL JoinsDataFrame JoinsDataset JoinsConclusion
5. Effective Transformations
Narrow Versus Wide TransformationsImplications for PerformanceImplications for Fault ToleranceThe Special Case of coalesceWhat Type of RDD Does Your Transformation Return?Minimizing Object CreationReusing Existing ObjectsUsing Smaller Data StructuresIterator-to-Iterator Transformations with mapPartitionsWhat Is an Iterator-to-Iterator Transformation?Space and Time AdvantagesAn ExampleSet OperationsReducing Setup OverheadShared VariablesBroadcast VariablesAccumulatorsReusing RDDsCases for ReuseDeciding if Recompute Is Inexpensive EnoughTypes of Reuse: Cache, Persist, Checkpoint, Shuffle FilesAlluxio (nee Tachyon)LRU CachingNoisy Cluster ConsiderationsInteraction with AccumulatorsConclusion
6. Working with Key/Value Data
The Goldilocks ExampleGoldilocks Version 0: Iterative SolutionHow to Use PairRDDFunctions and OrderedRDDFunctionsActions on Key/Value PairsWhat’s So Dangerous About the groupByKey FunctionGoldilocks Version 1: groupByKey SolutionChoosing an Aggregation OperationDictionary of Aggregation Operations with Performance ConsiderationsMultiple RDD OperationsCo-GroupingPartitioners and Key/Value DataUsing the Spark Partitioner ObjectHash PartitioningRange PartitioningCustom PartitioningPreserving Partitioning Information Across TransformationsLeveraging Co-Located and Co-Partitioned RDDsDictionary of Mapping and Partitioning Functions PairRDDFunctionsDictionary of OrderedRDDOperationsSorting by Two Keys with SortByKeySecondary Sort and repartitionAndSortWithinPartitionsLeveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values FunctionHow Not to Sort by Two OrderingsGoldilocks Version 2: Secondary SortA Different Approach to GoldilocksGoldilocks Version 3: Sort on Cell ValuesStraggler Detection and Unbalanced DataBack to Goldilocks (Again)Goldilocks Version 4: Reduce to Distinct on Each PartitionConclusion
7. Going Beyond Scala
Beyond Scala within the JVMBeyond Scala, and Beyond the JVMHow PySpark WorksHow SparkR WorksSpark.jl (Julia Spark)How Eclair JS WorksSpark on the Common Language Runtime (CLR)—C# and FriendsCalling Other Languages from SparkUsing Pipe and FriendsJNIJava Native Access (JNA)Underneath Everything Is FORTRANGetting to the GPUThe FutureConclusion
8. Testing and Validation
Unit TestingGeneral Spark Unit TestingMocking RDDsGetting Test DataGenerating Large DatasetsSamplingProperty Checking with ScalaCheckComputing RDD DifferenceIntegration TestingChoosing Your Integration Testing EnvironmentVerifying PerformanceSpark Counters for Verifying PerformanceProjects for Verifying PerformanceJob ValidationConclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark MLWorking with MLlibGetting Started with MLlib (Organization and Imports)MLlib Feature Encoding and Data PreparationFeature Scaling and SelectionMLlib Model TrainingPredictingServing and PersistenceModel EvaluationWorking with Spark MLSpark ML Organization and ImportsPipeline StagesExplain ParamsData EncodingData CleaningSpark ML ModelsPutting It All Together in a PipelineTraining a PipelineAccessing Individual StagesData Persistence and Spark MLExtending Spark ML Pipelines with Your Own AlgorithmsModel and Pipeline Persistence and Serving with Spark MLGeneral Serving ConsiderationsConclusion

10. Spark Components and Packages
Stream Processing with SparkSources and SinksBatch IntervalsData Checkpoint IntervalsConsiderations for DStreamsConsiderations for Structured StreamingHigh Availability Mode (or Handling Driver Failure or Checkpointing)GraphXUsing Community Packages and LibrariesCreating a Spark PackageConclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
Spark Tuning and Cluster SizingHow to Adjust Spark SettingsHow to Determine the Relevant Information About Your ClusterBasic Spark Core Settings: How Many Resources to Allocate to the Spark Application?Calculating Executor and Driver Memory OverheadHow Large to Make the Spark DriverA Few Large Executors or Many Small Executors?Allocating Cluster Resources and Dynamic AllocationDividing the Space Within One ExecutorNumber and Size of PartitionsSerialization OptionsKryoSome Additional Debugging Techniques
Index

Content preview from High Performance Spark

Chapter 10. Spark Components and Packages

Spark has a large number of components that are designed to work together as an integrated system, and many of them are distributed as part of Spark. This is different from the Hadoop ecosystem, which has different projects or systems for each task. You’ve already seen how to effectively use Spark Core, SQL, and ML components, and this chapter will introduce you to Spark’s Streaming components, as well as the external/community components (often referred to as packages). Having a largely integrated system gives Spark two advantages: it simplifies both deployment/cluster management and application development by having fewer dependencies and systems to keep track of.

Even early versions of Spark provided tools that traditionally would have required the coordination of multiple systems, as illustrated in Figure 10-1.

As Datasets and the Spark SQL engine have become a building block for other components inside of Spark, a minor reorganization illustrated in Figure 10-2 represents a more up-to-date version, including two of Spark’s newest components, Spark ML and Structured Streaming. Much of your knowledge from working with core Spark and Spark SQL can be applied to the other components—although there are some unique considerations for each one.

Figure 10-2. Spark 2.0+ revised components diagram

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491943199Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

High Performance Spark

by Holden Karau, Rachel Warren

Chapter 10. Spark Components and Packages

Figure 10-1. Spark components diagram

Figure 10-2. Spark 2.0+ revised components diagram

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.