Learning Spark, 2nd Edition

Book description

Data is bigger, arrives faster, and comes in a variety of formats—and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you’ll be able to:

  • Learn Python, SQL, Scala, or Java high-level Structured APIs
  • Understand Spark operations and SQL Engine
  • Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who This Book Is For
    2. How the Book Is Organized
    3. How to Use the Code Examples
    4. Software and Configuration Used
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  3. 1. Introduction to Apache Spark: A Unified Analytics Engine
    1. The Genesis of Spark
      1. Big Data and Distributed Computing at Google
      2. Hadoop at Yahoo!
      3. Spark’s Early Years at AMPLab
    2. What Is Apache Spark?
      1. Speed
      2. Ease of Use
      3. Modularity
      4. Extensibility
    3. Unified Analytics
      1. Apache Spark Components as a Unified Stack
      2. Apache Spark’s Distributed Execution
    4. The Developer’s Experience
      1. Who Uses Spark, and for What?
      2. Community Adoption and Expansion
  4. 2. Downloading Apache Spark and Getting Started
    1. Step 1: Downloading Apache Spark
      1. Spark’s Directories and Files
    2. Step 2: Using the Scala or PySpark Shell
      1. Using the Local Machine
    3. Step 3: Understanding Spark Application Concepts
      1. Spark Application and SparkSession
      2. Spark Jobs
      3. Spark Stages
      4. Spark Tasks
    4. Transformations, Actions, and Lazy Evaluation
      1. Narrow and Wide Transformations
    5. The Spark UI
    6. Your First Standalone Application
      1. Counting M&Ms for the Cookie Monster
      2. Building Standalone Applications in Scala
    7. Summary
  5. 3. Apache Spark’s Structured APIs
    1. Spark: What’s Underneath an RDD?
    2. Structuring Spark
      1. Key Merits and Benefits
    3. The DataFrame API
      1. Spark’s Basic Data Types
      2. Spark’s Structured and Complex Data Types
      3. Schemas and Creating DataFrames
      4. Columns and Expressions
      5. Rows
      6. Common DataFrame Operations
      7. End-to-End DataFrame Example
    4. The Dataset API
      1. Typed Objects, Untyped Objects, and Generic Rows
      2. Creating Datasets
      3. Dataset Operations
      4. End-to-End Dataset Example
    5. DataFrames Versus Datasets
      1. When to Use RDDs
    6. Spark SQL and the Underlying Engine
      1. The Catalyst Optimizer
    7. Summary
  6. 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
    1. Using Spark SQL in Spark Applications
      1. Basic Query Examples
    2. SQL Tables and Views
      1. Managed Versus UnmanagedTables
      2. Creating SQL Databases and Tables
      3. Creating Views
      4. Viewing the Metadata
      5. Caching SQL Tables
      6. Reading Tables into DataFrames
    3. Data Sources for DataFrames and SQL Tables
      1. DataFrameReader
      2. DataFrameWriter
      3. Parquet
      4. JSON
      5. CSV
      6. Avro
      7. ORC
      8. Images
      9. Binary Files
    4. Summary
  7. 5. Spark SQL and DataFrames: Interacting with External Data Sources
    1. Spark SQL and Apache Hive
      1. User-Defined Functions
    2. Querying with the Spark SQL Shell, Beeline, and Tableau
      1. Using the Spark SQL Shell
      2. Working with Beeline
      3. Working with Tableau
    3. External Data Sources
      1. JDBC and SQL Databases
      2. PostgreSQL
      3. MySQL
      4. Azure Cosmos DB
      5. MS SQL Server
      6. Other External Sources
    4. Higher-Order Functions in DataFrames and Spark SQL
      1. Option 1: Explode and Collect
      2. Option 2: User-Defined Function
      3. Built-in Functions for Complex Data Types
      4. Higher-Order Functions
    5. Common DataFrames and Spark SQL Operations
      1. Unions
      2. Joins
      3. Windowing
      4. Modifications
    6. Summary
  8. 6. Spark SQL and Datasets
    1. Single API for Java and Scala
      1. Scala Case Classes and JavaBeans for Datasets
    2. Working with Datasets
      1. Creating Sample Data
      2. Transforming Sample Data
    3. Memory Management for Datasets and DataFrames
    4. Dataset Encoders
      1. Spark’s Internal Format Versus Java Object Format
      2. Serialization and Deserialization (SerDe)
    5. Costs of Using Datasets
      1. Strategies to Mitigate Costs
    6. Summary
  9. 7. Optimizing and Tuning Spark Applications
    1. Optimizing and Tuning Spark for Efficiency
      1. Viewing and Setting Apache Spark Configurations
      2. Scaling Spark for Large Workloads
    2. Caching and Persistence of Data
      1. DataFrame.cache()
      2. DataFrame.persist()
      3. When to Cache and Persist
      4. When Not to Cache and Persist
    3. A Family of Spark Joins
      1. Broadcast Hash Join
      2. Shuffle Sort Merge Join
    4. Inspecting the Spark UI
      1. Journey Through the Spark UI Tabs
    5. Summary
  10. 8. Structured Streaming
    1. Evolution of the Apache Spark Stream Processing Engine
      1. The Advent of Micro-Batch Stream Processing
      2. Lessons Learned from Spark Streaming (DStreams)
      3. The Philosophy of Structured Streaming
    2. The Programming Model of Structured Streaming
    3. The Fundamentals of a Structured Streaming Query
      1. Five Steps to Define a Streaming Query
      2. Under the Hood of an Active Streaming Query
      3. Recovering from Failures with Exactly-Once Guarantees
      4. Monitoring an Active Query
    4. Streaming Data Sources and Sinks
      1. Files
      2. Apache Kafka
      3. Custom Streaming Sources and Sinks
    5. Data Transformations
      1. Incremental Execution and Streaming State
      2. Stateless Transformations
      3. Stateful Transformations
    6. Stateful Streaming Aggregations
      1. Aggregations Not Based on Time
      2. Aggregations with Event-Time Windows
    7. Streaming Joins
      1. Stream–Static Joins
      2. Stream–Stream Joins
    8. Arbitrary Stateful Computations
      1. Modeling Arbitrary Stateful Operations with mapGroupsWithState()
      2. Using Timeouts to Manage Inactive Groups
      3. Generalization with flatMapGroupsWithState()
    9. Performance Tuning
    10. Summary
  11. 9. Building Reliable Data Lakes with Apache Spark
    1. The Importance of an Optimal Storage Solution
    2. Databases
      1. A Brief Introduction to Databases
      2. Reading from and Writing to Databases Using Apache Spark
      3. Limitations of Databases
    3. Data Lakes
      1. A Brief Introduction to Data Lakes
      2. Reading from and Writing to Data Lakes using Apache Spark
      3. Limitations of Data Lakes
    4. Lakehouses: The Next Step in the Evolution of Storage Solutions
      1. Apache Hudi
      2. Apache Iceberg
      3. Delta Lake
    5. Building Lakehouses with Apache Spark and Delta Lake
      1. Configuring Apache Spark with Delta Lake
      2. Loading Data into a Delta Lake Table
      3. Loading Data Streams into a Delta Lake Table
      4. Enforcing Schema on Write to Prevent Data Corruption
      5. Evolving Schemas to Accommodate Changing Data
      6. Transforming Existing Data
      7. Auditing Data Changes with Operation History
      8. Querying Previous Snapshots of a Table with Time Travel
    6. Summary
  12. 10. Machine Learning with MLlib
    1. What Is Machine Learning?
      1. Supervised Learning
      2. Unsupervised Learning
      3. Why Spark for Machine Learning?
    2. Designing Machine Learning Pipelines
      1. Data Ingestion and Exploration
      2. Creating Training and Test Data Sets
      3. Preparing Features with Transformers
      4. Understanding Linear Regression
      5. Using Estimators to Build Models
      6. Creating a Pipeline
      7. Evaluating Models
      8. Saving and Loading Models
    3. Hyperparameter Tuning
      1. Tree-Based Models
      2. k-Fold Cross-Validation
      3. Optimizing Pipelines
    4. Summary
  13. 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
    1. Model Management
      1. MLflow
    2. Model Deployment Options with MLlib
      1. Batch
      2. Streaming
      3. Model Export Patterns for Real-Time Inference
    3. Leveraging Spark for Non-MLlib Models
      1. Pandas UDFs
      2. Spark for Distributed Hyperparameter Tuning
    4. Summary
  14. 12. Epilogue: Apache Spark 3.0
    1. Spark Core and Spark SQL
      1. Dynamic Partition Pruning
      2. Adaptive Query Execution
      3. SQL Join Hints
      4. Catalog Plugin API and DataSourceV2
      5. Accelerator-Aware Scheduler
    2. Structured Streaming
    3. PySpark, Pandas UDFs, and Pandas Function APIs
      1. Redesigned Pandas UDFs with Python Type Hints
      2. Iterator Support in Pandas UDFs
      3. New Pandas Function APIs
    4. Changed Functionality
      1. Languages Supported and Deprecated
      2. Changes to the DataFrame and Dataset APIs
      3. DataFrame and SQL Explain Commands
    5. Summary
  15. Index

Product information

  • Title: Learning Spark, 2nd Edition
  • Author(s): Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
  • Release date: July 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492050049