book

Learning Spark, 2nd Edition

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

July 2020

Intermediate to advanced

397 pages

9h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForHow the Book Is OrganizedHow to Use the Code ExamplesSoftware and Configuration UsedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
The Genesis of SparkBig Data and Distributed Computing at GoogleHadoop at Yahoo!Spark’s Early Years at AMPLabWhat Is Apache Spark?SpeedEase of UseModularityExtensibilityUnified AnalyticsApache Spark Components as a Unified StackApache Spark’s Distributed ExecutionThe Developer’s ExperienceWho Uses Spark, and for What?Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Downloading Apache SparkSpark’s Directories and FilesStep 2: Using the Scala or PySpark ShellUsing the Local MachineStep 3: Understanding Spark Application ConceptsSpark Application and SparkSessionSpark JobsSpark StagesSpark TasksTransformations, Actions, and Lazy EvaluationNarrow and Wide TransformationsThe Spark UIYour First Standalone ApplicationCounting M&Ms for the Cookie MonsterBuilding Standalone Applications in ScalaSummary
3. Apache Spark’s Structured APIs
Spark: What’s Underneath an RDD?Structuring SparkKey Merits and BenefitsThe DataFrame APISpark’s Basic Data TypesSpark’s Structured and Complex Data TypesSchemas and Creating DataFramesColumns and ExpressionsRowsCommon DataFrame OperationsEnd-to-End DataFrame ExampleThe Dataset APITyped Objects, Untyped Objects, and Generic RowsCreating DatasetsDataset OperationsEnd-to-End Dataset ExampleDataFrames Versus DatasetsWhen to Use RDDsSpark SQL and the Underlying EngineThe Catalyst OptimizerSummary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
Using Spark SQL in Spark ApplicationsBasic Query ExamplesSQL Tables and ViewsManaged Versus UnmanagedTablesCreating SQL Databases and TablesCreating ViewsViewing the MetadataCaching SQL TablesReading Tables into DataFramesData Sources for DataFrames and SQL TablesDataFrameReaderDataFrameWriterParquetJSONCSVAvroORCImagesBinary FilesSummary
5. Spark SQL and DataFrames: Interacting with External Data Sources
Spark SQL and Apache HiveUser-Defined FunctionsQuerying with the Spark SQL Shell, Beeline, and TableauUsing the Spark SQL ShellWorking with BeelineWorking with TableauExternal Data SourcesJDBC and SQL DatabasesPostgreSQLMySQLAzure Cosmos DBMS SQL ServerOther External SourcesHigher-Order Functions in DataFrames and Spark SQLOption 1: Explode and CollectOption 2: User-Defined FunctionBuilt-in Functions for Complex Data TypesHigher-Order FunctionsCommon DataFrames and Spark SQL OperationsUnionsJoinsWindowingModificationsSummary
6. Spark SQL and Datasets
Single API for Java and ScalaScala Case Classes and JavaBeans for DatasetsWorking with DatasetsCreating Sample DataTransforming Sample DataMemory Management for Datasets and DataFramesDataset EncodersSpark’s Internal Format Versus Java Object FormatSerialization and Deserialization (SerDe)Costs of Using DatasetsStrategies to Mitigate CostsSummary
7. Optimizing and Tuning Spark Applications
Optimizing and Tuning Spark for EfficiencyViewing and Setting Apache Spark ConfigurationsScaling Spark for Large WorkloadsCaching and Persistence of DataDataFrame.cache()DataFrame.persist()When to Cache and PersistWhen Not to Cache and PersistA Family of Spark JoinsBroadcast Hash JoinShuffle Sort Merge JoinInspecting the Spark UIJourney Through the Spark UI TabsSummary
8. Structured Streaming
Evolution of the Apache Spark Stream Processing EngineThe Advent of Micro-Batch Stream ProcessingLessons Learned from Spark Streaming (DStreams)The Philosophy of Structured StreamingThe Programming Model of Structured StreamingThe Fundamentals of a Structured Streaming QueryFive Steps to Define a Streaming QueryUnder the Hood of an Active Streaming QueryRecovering from Failures with Exactly-Once GuaranteesMonitoring an Active QueryStreaming Data Sources and SinksFilesApache KafkaCustom Streaming Sources and SinksData TransformationsIncremental Execution and Streaming StateStateless TransformationsStateful TransformationsStateful Streaming AggregationsAggregations Not Based on TimeAggregations with Event-Time WindowsStreaming JoinsStream–Static JoinsStream–Stream JoinsArbitrary Stateful ComputationsModeling Arbitrary Stateful Operations with mapGroupsWithState()Using Timeouts to Manage Inactive GroupsGeneralization with flatMapGroupsWithState()Performance TuningSummary

9. Building Reliable Data Lakes with Apache Spark
The Importance of an Optimal Storage SolutionDatabasesA Brief Introduction to DatabasesReading from and Writing to Databases Using Apache SparkLimitations of DatabasesData LakesA Brief Introduction to Data LakesReading from and Writing to Data Lakes using Apache SparkLimitations of Data LakesLakehouses: The Next Step in the Evolution of Storage SolutionsApache HudiApache IcebergDelta LakeBuilding Lakehouses with Apache Spark and Delta LakeConfiguring Apache Spark with Delta LakeLoading Data into a Delta Lake TableLoading Data Streams into a Delta Lake TableEnforcing Schema on Write to Prevent Data CorruptionEvolving Schemas to Accommodate Changing DataTransforming Existing DataAuditing Data Changes with Operation HistoryQuerying Previous Snapshots of a Table with Time TravelSummary
10. Machine Learning with MLlib
What Is Machine Learning?Supervised LearningUnsupervised LearningWhy Spark for Machine Learning?Designing Machine Learning PipelinesData Ingestion and ExplorationCreating Training and Test Data SetsPreparing Features with TransformersUnderstanding Linear RegressionUsing Estimators to Build ModelsCreating a PipelineEvaluating ModelsSaving and Loading ModelsHyperparameter TuningTree-Based Modelsk-Fold Cross-ValidationOptimizing PipelinesSummary
11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Model ManagementMLflowModel Deployment Options with MLlibBatchStreamingModel Export Patterns for Real-Time InferenceLeveraging Spark for Non-MLlib ModelsPandas UDFsSpark for Distributed Hyperparameter TuningSummary
12. Epilogue: Apache Spark 3.0
Spark Core and Spark SQLDynamic Partition PruningAdaptive Query ExecutionSQL Join HintsCatalog Plugin API and DataSourceV2Accelerator-Aware SchedulerStructured StreamingPySpark, Pandas UDFs, and Pandas Function APIsRedesigned Pandas UDFs with Python Type HintsIterator Support in Pandas UDFsNew Pandas Function APIsChanged FunctionalityLanguages Supported and DeprecatedChanges to the DataFrame and Dataset APIsDataFrame and SQL Explain CommandsSummary
Index
About the Authors

Content preview from Learning Spark, 2nd Edition

Chapter 12. Epilogue: Apache Spark 3.0

At the time we were writing this book, Apache Spark 3.0 had not yet been officially released; it was still under development, and we got to work with Spark 3.0.0-preview2. All the code samples in this book have been tested against Spark 3.0.0-preview2, and they should work no differently with the official Spark 3.0 release. Whenever possible in the chapters, where relevant, we mentioned when features were new additions or behaviors in Spark 3.0. In this chapter, we survey the changes.

The bug fixes and feature enhancements are numerous, so for brevity, we highlight just a selection of the notable changes and features pertaining to Spark components. Some of the new features are, under the hood, advanced and beyond the scope of this book, but we mention them here so you can explore them when the release is generally available.

Spark Core and Spark SQL

Let’s first consider what’s new under the covers. A number of changes have been introduced in Spark Core and the Spark SQL engine to help speed up queries. One way to expedite queries is to read less data using dynamic partition pruning. Another is to adapt and optimize query plans during execution.

Dynamic Partition Pruning

The idea behind dynamic partition pruning (DPP) is to skip over the data you don’t need in a query’s results. The typical scenario where DPP is optimal is when you are joining two tables: a fact table (partitioned over multiple columns) and a dimension table (nonpartitioned), ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492050032Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design