book

Learning Spark, 2nd Edition

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

July 2020

Intermediate to advanced

397 pages

9h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForHow the Book Is OrganizedHow to Use the Code ExamplesSoftware and Configuration UsedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
The Genesis of SparkBig Data and Distributed Computing at GoogleHadoop at Yahoo!Spark’s Early Years at AMPLabWhat Is Apache Spark?SpeedEase of UseModularityExtensibilityUnified AnalyticsApache Spark Components as a Unified StackApache Spark’s Distributed ExecutionThe Developer’s ExperienceWho Uses Spark, and for What?Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Downloading Apache SparkSpark’s Directories and FilesStep 2: Using the Scala or PySpark ShellUsing the Local MachineStep 3: Understanding Spark Application ConceptsSpark Application and SparkSessionSpark JobsSpark StagesSpark TasksTransformations, Actions, and Lazy EvaluationNarrow and Wide TransformationsThe Spark UIYour First Standalone ApplicationCounting M&Ms for the Cookie MonsterBuilding Standalone Applications in ScalaSummary
3. Apache Spark’s Structured APIs
Spark: What’s Underneath an RDD?Structuring SparkKey Merits and BenefitsThe DataFrame APISpark’s Basic Data TypesSpark’s Structured and Complex Data TypesSchemas and Creating DataFramesColumns and ExpressionsRowsCommon DataFrame OperationsEnd-to-End DataFrame ExampleThe Dataset APITyped Objects, Untyped Objects, and Generic RowsCreating DatasetsDataset OperationsEnd-to-End Dataset ExampleDataFrames Versus DatasetsWhen to Use RDDsSpark SQL and the Underlying EngineThe Catalyst OptimizerSummary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
Using Spark SQL in Spark ApplicationsBasic Query ExamplesSQL Tables and ViewsManaged Versus UnmanagedTablesCreating SQL Databases and TablesCreating ViewsViewing the MetadataCaching SQL TablesReading Tables into DataFramesData Sources for DataFrames and SQL TablesDataFrameReaderDataFrameWriterParquetJSONCSVAvroORCImagesBinary FilesSummary
5. Spark SQL and DataFrames: Interacting with External Data Sources
Spark SQL and Apache HiveUser-Defined FunctionsQuerying with the Spark SQL Shell, Beeline, and TableauUsing the Spark SQL ShellWorking with BeelineWorking with TableauExternal Data SourcesJDBC and SQL DatabasesPostgreSQLMySQLAzure Cosmos DBMS SQL ServerOther External SourcesHigher-Order Functions in DataFrames and Spark SQLOption 1: Explode and CollectOption 2: User-Defined FunctionBuilt-in Functions for Complex Data TypesHigher-Order FunctionsCommon DataFrames and Spark SQL OperationsUnionsJoinsWindowingModificationsSummary
6. Spark SQL and Datasets
Single API for Java and ScalaScala Case Classes and JavaBeans for DatasetsWorking with DatasetsCreating Sample DataTransforming Sample DataMemory Management for Datasets and DataFramesDataset EncodersSpark’s Internal Format Versus Java Object FormatSerialization and Deserialization (SerDe)Costs of Using DatasetsStrategies to Mitigate CostsSummary
7. Optimizing and Tuning Spark Applications
Optimizing and Tuning Spark for EfficiencyViewing and Setting Apache Spark ConfigurationsScaling Spark for Large WorkloadsCaching and Persistence of DataDataFrame.cache()DataFrame.persist()When to Cache and PersistWhen Not to Cache and PersistA Family of Spark JoinsBroadcast Hash JoinShuffle Sort Merge JoinInspecting the Spark UIJourney Through the Spark UI TabsSummary
8. Structured Streaming
Evolution of the Apache Spark Stream Processing EngineThe Advent of Micro-Batch Stream ProcessingLessons Learned from Spark Streaming (DStreams)The Philosophy of Structured StreamingThe Programming Model of Structured StreamingThe Fundamentals of a Structured Streaming QueryFive Steps to Define a Streaming QueryUnder the Hood of an Active Streaming QueryRecovering from Failures with Exactly-Once GuaranteesMonitoring an Active QueryStreaming Data Sources and SinksFilesApache KafkaCustom Streaming Sources and SinksData TransformationsIncremental Execution and Streaming StateStateless TransformationsStateful TransformationsStateful Streaming AggregationsAggregations Not Based on TimeAggregations with Event-Time WindowsStreaming JoinsStream–Static JoinsStream–Stream JoinsArbitrary Stateful ComputationsModeling Arbitrary Stateful Operations with mapGroupsWithState()Using Timeouts to Manage Inactive GroupsGeneralization with flatMapGroupsWithState()Performance TuningSummary

9. Building Reliable Data Lakes with Apache Spark
The Importance of an Optimal Storage SolutionDatabasesA Brief Introduction to DatabasesReading from and Writing to Databases Using Apache SparkLimitations of DatabasesData LakesA Brief Introduction to Data LakesReading from and Writing to Data Lakes using Apache SparkLimitations of Data LakesLakehouses: The Next Step in the Evolution of Storage SolutionsApache HudiApache IcebergDelta LakeBuilding Lakehouses with Apache Spark and Delta LakeConfiguring Apache Spark with Delta LakeLoading Data into a Delta Lake TableLoading Data Streams into a Delta Lake TableEnforcing Schema on Write to Prevent Data CorruptionEvolving Schemas to Accommodate Changing DataTransforming Existing DataAuditing Data Changes with Operation HistoryQuerying Previous Snapshots of a Table with Time TravelSummary
10. Machine Learning with MLlib
What Is Machine Learning?Supervised LearningUnsupervised LearningWhy Spark for Machine Learning?Designing Machine Learning PipelinesData Ingestion and ExplorationCreating Training and Test Data SetsPreparing Features with TransformersUnderstanding Linear RegressionUsing Estimators to Build ModelsCreating a PipelineEvaluating ModelsSaving and Loading ModelsHyperparameter TuningTree-Based Modelsk-Fold Cross-ValidationOptimizing PipelinesSummary
11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Model ManagementMLflowModel Deployment Options with MLlibBatchStreamingModel Export Patterns for Real-Time InferenceLeveraging Spark for Non-MLlib ModelsPandas UDFsSpark for Distributed Hyperparameter TuningSummary
12. Epilogue: Apache Spark 3.0
Spark Core and Spark SQLDynamic Partition PruningAdaptive Query ExecutionSQL Join HintsCatalog Plugin API and DataSourceV2Accelerator-Aware SchedulerStructured StreamingPySpark, Pandas UDFs, and Pandas Function APIsRedesigned Pandas UDFs with Python Type HintsIterator Support in Pandas UDFsNew Pandas Function APIsChanged FunctionalityLanguages Supported and DeprecatedChanges to the DataFrame and Dataset APIsDataFrame and SQL Explain CommandsSummary
Index
About the Authors

Content preview from Learning Spark, 2nd Edition

Foreword

Apache Spark has evolved significantly since I first started the project at UC Berkeley in 2009. After moving to the Apache Software Foundation, the open source project has had over 1,400 contributors from hundreds of companies, and the global Spark meetup group has grown to over half a million members. Spark’s user base has also become highly diverse, encompassing Python, R, SQL, and JVM developers, with use cases ranging from data science to business intelligence to data engineering. I have been working closely with the Apache Spark community to help continue its development, and I am thrilled to see the progress thus far.

The release of Spark 3.0 marks an important milestone for the project and has sparked the need for updated learning material. The idea of a second edition of Learning Spark has come up many times—and it was overdue. Even though I coauthored both Learning Spark and Spark: The Definitive Guide (both O’Reilly), it was time for me to let the next generation of Spark contributors pick up the narrative. I’m delighted that four experienced practitioners and developers, who have been working closely with Apache Spark from its early days, have teamed up to write this second edition of the book, incorporating the most recent APIs and best practices for Spark developers in a clear and informative guide.

The authors’ approach to this edition is highly conducive to hands-on learning. The key concepts in Spark and distributed big data processing have been distilled ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492050032Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning Spark, 2nd Edition

by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

Foreword

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.