book

Learning Spark, 2nd Edition

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

July 2020

Intermediate to advanced

397 pages

9h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForHow the Book Is OrganizedHow to Use the Code ExamplesSoftware and Configuration UsedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
The Genesis of SparkBig Data and Distributed Computing at GoogleHadoop at Yahoo!Spark’s Early Years at AMPLabWhat Is Apache Spark?SpeedEase of UseModularityExtensibilityUnified AnalyticsApache Spark Components as a Unified StackApache Spark’s Distributed ExecutionThe Developer’s ExperienceWho Uses Spark, and for What?Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Downloading Apache SparkSpark’s Directories and FilesStep 2: Using the Scala or PySpark ShellUsing the Local MachineStep 3: Understanding Spark Application ConceptsSpark Application and SparkSessionSpark JobsSpark StagesSpark TasksTransformations, Actions, and Lazy EvaluationNarrow and Wide TransformationsThe Spark UIYour First Standalone ApplicationCounting M&Ms for the Cookie MonsterBuilding Standalone Applications in ScalaSummary
3. Apache Spark’s Structured APIs
Spark: What’s Underneath an RDD?Structuring SparkKey Merits and BenefitsThe DataFrame APISpark’s Basic Data TypesSpark’s Structured and Complex Data TypesSchemas and Creating DataFramesColumns and ExpressionsRowsCommon DataFrame OperationsEnd-to-End DataFrame ExampleThe Dataset APITyped Objects, Untyped Objects, and Generic RowsCreating DatasetsDataset OperationsEnd-to-End Dataset ExampleDataFrames Versus DatasetsWhen to Use RDDsSpark SQL and the Underlying EngineThe Catalyst OptimizerSummary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
Using Spark SQL in Spark ApplicationsBasic Query ExamplesSQL Tables and ViewsManaged Versus UnmanagedTablesCreating SQL Databases and TablesCreating ViewsViewing the MetadataCaching SQL TablesReading Tables into DataFramesData Sources for DataFrames and SQL TablesDataFrameReaderDataFrameWriterParquetJSONCSVAvroORCImagesBinary FilesSummary
5. Spark SQL and DataFrames: Interacting with External Data Sources
Spark SQL and Apache HiveUser-Defined FunctionsQuerying with the Spark SQL Shell, Beeline, and TableauUsing the Spark SQL ShellWorking with BeelineWorking with TableauExternal Data SourcesJDBC and SQL DatabasesPostgreSQLMySQLAzure Cosmos DBMS SQL ServerOther External SourcesHigher-Order Functions in DataFrames and Spark SQLOption 1: Explode and CollectOption 2: User-Defined FunctionBuilt-in Functions for Complex Data TypesHigher-Order FunctionsCommon DataFrames and Spark SQL OperationsUnionsJoinsWindowingModificationsSummary
6. Spark SQL and Datasets
Single API for Java and ScalaScala Case Classes and JavaBeans for DatasetsWorking with DatasetsCreating Sample DataTransforming Sample DataMemory Management for Datasets and DataFramesDataset EncodersSpark’s Internal Format Versus Java Object FormatSerialization and Deserialization (SerDe)Costs of Using DatasetsStrategies to Mitigate CostsSummary
7. Optimizing and Tuning Spark Applications
Optimizing and Tuning Spark for EfficiencyViewing and Setting Apache Spark ConfigurationsScaling Spark for Large WorkloadsCaching and Persistence of DataDataFrame.cache()DataFrame.persist()When to Cache and PersistWhen Not to Cache and PersistA Family of Spark JoinsBroadcast Hash JoinShuffle Sort Merge JoinInspecting the Spark UIJourney Through the Spark UI TabsSummary
8. Structured Streaming
Evolution of the Apache Spark Stream Processing EngineThe Advent of Micro-Batch Stream ProcessingLessons Learned from Spark Streaming (DStreams)The Philosophy of Structured StreamingThe Programming Model of Structured StreamingThe Fundamentals of a Structured Streaming QueryFive Steps to Define a Streaming QueryUnder the Hood of an Active Streaming QueryRecovering from Failures with Exactly-Once GuaranteesMonitoring an Active QueryStreaming Data Sources and SinksFilesApache KafkaCustom Streaming Sources and SinksData TransformationsIncremental Execution and Streaming StateStateless TransformationsStateful TransformationsStateful Streaming AggregationsAggregations Not Based on TimeAggregations with Event-Time WindowsStreaming JoinsStream–Static JoinsStream–Stream JoinsArbitrary Stateful ComputationsModeling Arbitrary Stateful Operations with mapGroupsWithState()Using Timeouts to Manage Inactive GroupsGeneralization with flatMapGroupsWithState()Performance TuningSummary

9. Building Reliable Data Lakes with Apache Spark
The Importance of an Optimal Storage SolutionDatabasesA Brief Introduction to DatabasesReading from and Writing to Databases Using Apache SparkLimitations of DatabasesData LakesA Brief Introduction to Data LakesReading from and Writing to Data Lakes using Apache SparkLimitations of Data LakesLakehouses: The Next Step in the Evolution of Storage SolutionsApache HudiApache IcebergDelta LakeBuilding Lakehouses with Apache Spark and Delta LakeConfiguring Apache Spark with Delta LakeLoading Data into a Delta Lake TableLoading Data Streams into a Delta Lake TableEnforcing Schema on Write to Prevent Data CorruptionEvolving Schemas to Accommodate Changing DataTransforming Existing DataAuditing Data Changes with Operation HistoryQuerying Previous Snapshots of a Table with Time TravelSummary
10. Machine Learning with MLlib
What Is Machine Learning?Supervised LearningUnsupervised LearningWhy Spark for Machine Learning?Designing Machine Learning PipelinesData Ingestion and ExplorationCreating Training and Test Data SetsPreparing Features with TransformersUnderstanding Linear RegressionUsing Estimators to Build ModelsCreating a PipelineEvaluating ModelsSaving and Loading ModelsHyperparameter TuningTree-Based Modelsk-Fold Cross-ValidationOptimizing PipelinesSummary
11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Model ManagementMLflowModel Deployment Options with MLlibBatchStreamingModel Export Patterns for Real-Time InferenceLeveraging Spark for Non-MLlib ModelsPandas UDFsSpark for Distributed Hyperparameter TuningSummary
12. Epilogue: Apache Spark 3.0
Spark Core and Spark SQLDynamic Partition PruningAdaptive Query ExecutionSQL Join HintsCatalog Plugin API and DataSourceV2Accelerator-Aware SchedulerStructured StreamingPySpark, Pandas UDFs, and Pandas Function APIsRedesigned Pandas UDFs with Python Type HintsIterator Support in Pandas UDFsNew Pandas Function APIsChanged FunctionalityLanguages Supported and DeprecatedChanges to the DataFrame and Dataset APIsDataFrame and SQL Explain CommandsSummary
Index
About the Authors

Content preview from Learning Spark, 2nd Edition

Chapter 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark

In the previous chapter, we covered how to build machine learning pipelines with MLlib. This chapter will focus on how to manage and deploy the models you train. By the end of this chapter, you will be able to use MLflow to track, reproduce, and deploy your MLlib models, discuss the difficulties of and trade-offs among various model deployment scenarios, and architect scalable machine learning solutions. But before we discuss deploying models, let’s first discuss some best practices for model management to get your models ready for deployment.

Model Management

Before you deploy your machine learning model, you should ensure that you can reproduce and track the model’s performance. For us, end-to-end reproducibility of machine learning solutions means that we need to be able to reproduce the code that generated a model, the environment used in training, the data it was trained on, and the model itself. Every data scientist loves to remind you to set your seeds so you can reproduce your experiments (e.g., for the train/test split, when using models with inherent randomness such as random forests). However, there are many more aspects that contribute to reproducibility than just setting seeds, and some of them are much more subtle. Here are a few examples:

Library versioning: When a data scientist hands you their code, they may or may not mention the dependent libraries. While you are able ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492050032Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning Spark, 2nd Edition

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

Chapter 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark

Model Management

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.