O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Spark, 2nd Edition

Book Description

Data is getting bigger, arriving faster, and coming in varied formats—and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.

Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you’ll be able to:

  • Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
  • Peek under the hood of the Spark SQL engine to understand Spark transformations and performance
  • Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow
  • Use open source Pandas framework Koalas and Spark for data transformation and feature engineering


Table of Contents

  1. 1. Introduction to Unified Analytics with Apache Spark
    1. The Genesis of Big Data and Distributed Computing at Google
    2. Hadoop at Yahoo!
      1. Spark’s Early Years at AMPLab
    3. What is Apache Spark?
      1. Speed
      2. Ease of Use
      3. Modularity
      4. Extensibility
    4. Why Unified Analytics?
      1. Apache Spark Components as a Unified Stack
      2. Apache Spark’s Distributed Execution and Concepts
    5. Developer’s Experience
    6. Who Uses Spark, and for What?
      1. Data Science Tasks
      2. Data Engineering Tasks
      3. Machine Learning or Deep Learning Tasks
      4. Community Adoption and Expansion
  2. 2. Downloading Apache Spark and Getting Started
    1. Step 1: Download Apache Spark
      1. Spark’s Directories and Files
    2. Step 2: Use Scala Shell or PySpark Shell
      1. Using Local Machine
    3. Step 3: Understand Spark Application Concepts
      1. Spark Application and SparkSession
      2. Spark Jobs
      3. Spark Stages
      4. Spark Tasks
      5. Transformations, Actions, and Lazy Evaluation
    4. Spark UI
      1. Databricks Community Edition
    5. First Standalone Application
      1. Using Local Machine
      2. Counting M&Ms for the Cookie Monster
      3. Building Standalone Applications in Scala
    6. Summary
  3. 3. Apache Spark’s Structured APIs
    1. A Bit of History…
    2. Unstructured Spark: What’s Underneath an RDD?
    3. Structuring Spark
      1. Key Merits and Benefits
    4. Structured APIs: DataFrames and Datasets APIs
      1. DataFrames API
      2. Common DataFrame Operations
      3. Datasets API
      4. DataFrames vs Datasets
      5. What about RDDs?
    5. Spark SQL and the Underlying Engine
      1. Catalyst Optimizer
    6. Summary
  4. 4. Spark SQL and DataFrames — Introduction to Built-in Data Sources
    1. Using Spark SQL in Spark Applications
      1. Basic Query Example
    2. SQL Tables and Views
    3. Data Sources for DataFrames and SQL Tables
      1. DataFrameReader
      2. DataFrameWriter
      3. Parquet
      4. JSON
      5. CSV
      6. Avro
      7. ORC
      8. Image
    4. Summary
  5. 5. Spark SQL and Datasets
    1. Single API for Java and Scala
      1. Scala Case Classes and JavaBeans for Datasets
    2. Working with Datasets
      1. Creating Sample Data
      2. Transforming Sample Data
    3. Memory Management for Datasets and DataFrames
    4. Dataset Encoders
      1. Spark’s Internal Format vs Java Object Format
      2. Serialization and Deserialization (SerDe)
    5. Costs of Using Datasets
      1. Strategies to Mitigate Costs
    6. Summary
  6. 6. Loading and Saving Your Data
    1. Motivation for Data Sources
    2. File Formats: Revisited
      1. Text Files
    3. Organizing Data for Efficient I/O
      1. Partitioning
      2. Bucketing
      3. Compression Schemes
    4. Saving as Parquet Files
      1. Delta Lake Storage Format
      2. Delta Lake Table
    5. Summary