book

High Performance Spark

by Holden Karau, Rachel Warren

May 2017

Intermediate to advanced

358 pages

10h 4m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
First Edition NotesSupporting Books and MaterialsConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact the AuthorsHow to Contact UsAcknowledgments
1. Introduction to High Performance Spark
What Is Spark and Why Performance MattersWhat You Can Expect to Get from This BookSpark VersionsWhy Scala?To Be a Spark Expert You Have to Learn a Little Scala AnywayThe Spark Scala API Is Easier to Use Than the Java APIScala Is More Performant Than PythonWhy Not Scala?Learning ScalaConclusion
2. How Spark Works
How Spark Fits into the Big Data EcosystemSpark ComponentsSpark Model of Parallel Computing: RDDsLazy EvaluationIn-Memory Persistence and Memory ManagementImmutability and the RDD InterfaceTypes of RDDsFunctions on RDDs: Transformations Versus ActionsWide Versus Narrow DependenciesSpark Job SchedulingResource Allocation Across ApplicationsThe Spark ApplicationThe Anatomy of a Spark JobThe DAGJobsStagesTasksConclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)Spark SQL DependenciesManaging Spark DependenciesAvoiding Hive JARsBasics of SchemasDataFrame APITransformationsMulti-DataFrame TransformationsPlain Old SQL Queries and Interacting with Hive DataData Representation in DataFrames and DatasetsTungstenData Loading and Saving FunctionsDataFrameWriter and DataFrameReaderFormatsSave ModesPartitions (Discovery and Writing)DatasetsInteroperability with RDDs, DataFrames, and Local CollectionsCompile-Time Strong TypingEasier Functional (RDD “like”) TransformationsRelational TransformationsMulti-Dataset Relational TransformationsGrouped Operations on DatasetsExtending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)Query OptimizerLogical and Physical PlansCode GenerationLarge Query Plans and Iterative AlgorithmsDebugging Spark SQL QueriesJDBC/ODBC ServerConclusion
4. Joins (SQL and Core)
Core Spark JoinsChoosing a Join TypeChoosing an Execution PlanSpark SQL JoinsDataFrame JoinsDataset JoinsConclusion
5. Effective Transformations
Narrow Versus Wide TransformationsImplications for PerformanceImplications for Fault ToleranceThe Special Case of coalesceWhat Type of RDD Does Your Transformation Return?Minimizing Object CreationReusing Existing ObjectsUsing Smaller Data StructuresIterator-to-Iterator Transformations with mapPartitionsWhat Is an Iterator-to-Iterator Transformation?Space and Time AdvantagesAn ExampleSet OperationsReducing Setup OverheadShared VariablesBroadcast VariablesAccumulatorsReusing RDDsCases for ReuseDeciding if Recompute Is Inexpensive EnoughTypes of Reuse: Cache, Persist, Checkpoint, Shuffle FilesAlluxio (nee Tachyon)LRU CachingNoisy Cluster ConsiderationsInteraction with AccumulatorsConclusion
6. Working with Key/Value Data
The Goldilocks ExampleGoldilocks Version 0: Iterative SolutionHow to Use PairRDDFunctions and OrderedRDDFunctionsActions on Key/Value PairsWhat’s So Dangerous About the groupByKey FunctionGoldilocks Version 1: groupByKey SolutionChoosing an Aggregation OperationDictionary of Aggregation Operations with Performance ConsiderationsMultiple RDD OperationsCo-GroupingPartitioners and Key/Value DataUsing the Spark Partitioner ObjectHash PartitioningRange PartitioningCustom PartitioningPreserving Partitioning Information Across TransformationsLeveraging Co-Located and Co-Partitioned RDDsDictionary of Mapping and Partitioning Functions PairRDDFunctionsDictionary of OrderedRDDOperationsSorting by Two Keys with SortByKeySecondary Sort and repartitionAndSortWithinPartitionsLeveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values FunctionHow Not to Sort by Two OrderingsGoldilocks Version 2: Secondary SortA Different Approach to GoldilocksGoldilocks Version 3: Sort on Cell ValuesStraggler Detection and Unbalanced DataBack to Goldilocks (Again)Goldilocks Version 4: Reduce to Distinct on Each PartitionConclusion
7. Going Beyond Scala
Beyond Scala within the JVMBeyond Scala, and Beyond the JVMHow PySpark WorksHow SparkR WorksSpark.jl (Julia Spark)How Eclair JS WorksSpark on the Common Language Runtime (CLR)—C# and FriendsCalling Other Languages from SparkUsing Pipe and FriendsJNIJava Native Access (JNA)Underneath Everything Is FORTRANGetting to the GPUThe FutureConclusion
8. Testing and Validation
Unit TestingGeneral Spark Unit TestingMocking RDDsGetting Test DataGenerating Large DatasetsSamplingProperty Checking with ScalaCheckComputing RDD DifferenceIntegration TestingChoosing Your Integration Testing EnvironmentVerifying PerformanceSpark Counters for Verifying PerformanceProjects for Verifying PerformanceJob ValidationConclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark MLWorking with MLlibGetting Started with MLlib (Organization and Imports)MLlib Feature Encoding and Data PreparationFeature Scaling and SelectionMLlib Model TrainingPredictingServing and PersistenceModel EvaluationWorking with Spark MLSpark ML Organization and ImportsPipeline StagesExplain ParamsData EncodingData CleaningSpark ML ModelsPutting It All Together in a PipelineTraining a PipelineAccessing Individual StagesData Persistence and Spark MLExtending Spark ML Pipelines with Your Own AlgorithmsModel and Pipeline Persistence and Serving with Spark MLGeneral Serving ConsiderationsConclusion

10. Spark Components and Packages
Stream Processing with SparkSources and SinksBatch IntervalsData Checkpoint IntervalsConsiderations for DStreamsConsiderations for Structured StreamingHigh Availability Mode (or Handling Driver Failure or Checkpointing)GraphXUsing Community Packages and LibrariesCreating a Spark PackageConclusion
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
Spark Tuning and Cluster SizingHow to Adjust Spark SettingsHow to Determine the Relevant Information About Your ClusterBasic Spark Core Settings: How Many Resources to Allocate to the Spark Application?Calculating Executor and Driver Memory OverheadHow Large to Make the Spark DriverA Few Large Executors or Many Small Executors?Allocating Cluster Resources and Dynamic AllocationDividing the Space Within One ExecutorNumber and Size of PartitionsSerialization OptionsKryoSome Additional Debugging Techniques
Index

Content preview from High Performance Spark

Chapter 9. Spark MLlib and ML

Spark has two machine learning libraries—Spark MLlib and Spark ML—with very different APIs, but similar algorithms. These machine learning libraries inherit many of the performance considerations of the RDD and Dataset APIs they are based on, but also have their own considerations. MLlib is the first of the two libraries and is entering a maintenance/bug-fix only mode. Normally we would skip discussing Spark MLlib and focus on the new API; however, for existing algorithms not all of the functionality has been ported over to the new Spark ML API. Spark ML is the newer, scikit-learn inspired, machine learning library and is where new active development is taking place.

Choosing Between Spark MLlib and Spark ML

At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrames and Datasets. The data format difference isn’t all that important since they both deal with RDDs and Datasets of vectors, which are easily represented and converted between the RDD and Dataset formats.

From a design philosophy point of view, Spark’s MLlib is focused on providing a core set of algorithms for people to use, while largely leaving the data pipeline, cleaning, preparation, and feature selection problems up to the user. Spark ML instead focuses on exposing a scikit-learn inspired pipeline API for everything from data preparation to model training.

Currently, if you need to do streaming ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491943199Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

High Performance Spark

by Holden Karau, Rachel Warren

Chapter 9. Spark MLlib and ML

Choosing Between Spark MLlib and Spark ML

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.