O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Stream Processing with Apache Spark

Book Description

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs.

Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API.

  • Learn fundamental stream processing concepts and examine different streaming architectures
  • Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail
  • Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs
  • Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms
  • Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Table of Contents

  1. Foreword
  2. Preface
    1. Who Should Read This Book?
    2. Installing Spark
    3. Learning Scala
    4. The Way Ahead
    5. Bibliography
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
      1. From Gerard
      2. From François
  3. I. Fundamentals of Stream Processing with Apache Spark
  4. 1. Introducing Stream Processing
    1. What Is Stream Processing?
      1. Batch Versus Stream Processing
      2. The Notion of Time in Stream Processing
      3. The Factor of Uncertainty
    2. Some Examples of Stream Processing
    3. Scaling Up Data Processing
      1. MapReduce
      2. The Lesson Learned: Scalability and Fault Tolerance
    4. Distributed Stream Processing
      1. Stateful Stream Processing in a Distributed System
    5. Introducing Apache Spark
      1. The First Wave: Functional APIs
      2. The Second Wave: SQL
      3. A Unified Engine
      4. Spark Components
      5. Spark Streaming
      6. Structured Streaming
    6. Where Next?
  5. 2. Stream-Processing Model
    1. Sources and Sinks
    2. Immutable Streams Defined from One Another
    3. Transformations and Aggregations
    4. Window Aggregations
      1. Tumbling Windows
      2. Sliding Windows
    5. Stateless and Stateful Processing
    6. Stateful Streams
    7. An Example: Local Stateful Computation in Scala
      1. A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
    8. Stateless or Stateful Streaming
    9. The Effect of Time
      1. Computing on Timestamped Events
      2. Timestamps as the Provider of the Notion of Time
      3. Event Time Versus Processing Time
      4. Computing with a Watermark
    10. Summary
  6. 3. Streaming Architectures
    1. Components of a Data Platform
    2. Architectural Models
    3. The Use of a Batch-Processing Component in a Streaming Application
    4. Referential Streaming Architectures
      1. The Lambda Architecture
      2. The Kappa Architecture
    5. Streaming Versus Batch Algorithms
      1. Streaming Algorithms Are Sometimes Completely Different in Nature
      2. Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
    6. Summary
  7. 4. Apache Spark as a Stream-Processing Engine
    1. The Tale of Two APIs
    2. Spark’s Memory Usage
      1. Failure Recovery
      2. Lazy Evaluation
      3. Cache Hints
    3. Understanding Latency
    4. Throughput-Oriented Processing
    5. Spark’s Polyglot API
    6. Fast Implementation of Data Analysis
    7. To Learn More About Spark
    8. Summary
  8. 5. Spark’s Distributed Processing Model
    1. Running Apache Spark with a Cluster Manager
      1. Examples of Cluster Managers
    2. Spark’s Own Cluster Manager
    3. Understanding Resilience and Fault Tolerance in a Distributed System
      1. Fault Recovery
      2. Cluster Manager Support for Fault Tolerance
    4. Data Delivery Semantics
    5. Microbatching and One-Element-at-a-Time
      1. Microbatching: An Application of Bulk-Synchronous Processing
      2. One-Record-at-a-Time Processing
      3. Microbatching Versus One-at-a-Time: The Trade-Offs
    6. Bringing Microbatch and One-Record-at-a-Time Closer Together
    7. Dynamic Batch Interval
    8. Structured Streaming Processing Model
      1. The Disappearance of the Batch Interval
  9. 6. Spark’s Resilience Model
    1. Resilient Distributed Datasets in Spark
    2. Spark Components
    3. Spark’s Fault-Tolerance Guarantees
      1. Task Failure Recovery
      2. Stage Failure Recovery
      3. Driver Failure Recovery
    4. Summary
  10. A. References for Part I
  11. II. Structured Streaming
  12. 7. Introducing Structured Streaming
    1. First Steps with Structured Streaming
    2. Batch Analytics
    3. Streaming Analytics
      1. Connecting to a Stream
      2. Preparing the Data in the Stream
      3. Operations on Streaming Dataset
      4. Creating a Query
      5. Start the Stream Processing
      6. Exploring the Data
    4. Summary
  13. 8. The Structured Streaming Programming Model
    1. Initializing Spark
    2. Sources: Acquiring Streaming Data
      1. Available Sources
    3. Transforming Streaming Data
      1. Streaming API Restrictions on the DataFrame API
    4. Sinks: Output the Resulting Data
      1. format
      2. outputMode
      3. queryName
      4. option
      5. options
      6. trigger
      7. start()
    5. Summary
  14. 9. Structured Streaming in Action
    1. Consuming a Streaming Source
    2. Application Logic
    3. Writing to a Streaming Sink
    4. Summary
  15. 10. Structured Streaming Sources
    1. Understanding Sources
      1. Reliable Sources Must Be Replayable
      2. Sources Must Provide a Schema
    2. Available Sources
    3. The File Source
      1. Specifying a File Format
      2. Common Options
      3. Common Text Parsing Options (CSV, JSON)
      4. JSON File Source Format
      5. CSV File Source Format
      6. Parquet File Source Format
      7. Text File Source Format
    4. The Kafka Source
      1. Setting Up a Kafka Source
      2. Selecting a Topic Subscription Method
      3. Configuring Kafka Source Options
      4. Kafka Consumer Options
    5. The Socket Source
      1. Configuration
      2. Operations
    6. The Rate Source
      1. Options
  16. 11. Structured Streaming Sinks
    1. Understanding Sinks
    2. Available Sinks
      1. Reliable Sinks
      2. Sinks for Experimentation
      3. The Sink API
      4. Exploring Sinks in Detail
    3. The File Sink
      1. Using Triggers with the File Sink
      2. Common Configuration Options Across All Supported File Formats
      3. Common Time and Date Formatting (CSV, JSON)
      4. The CSV Format of the File Sink
      5. The JSON File Sink Format
      6. The Parquet File Sink Format
      7. The Text File Sink Format
    4. The Kafka Sink
      1. Understanding the Kafka Publish Model
      2. Using the Kafka Sink
    5. The Memory Sink
      1. Output Modes
    6. The Console Sink
      1. Options
      2. Output Modes
    7. The Foreach Sink
      1. The ForeachWriter Interface
      2. TCP Writer Sink: A Practical ForeachWriter Example
      3. The Moral of this Example
      4. Troubleshooting ForeachWriter Serialization Issues
  17. 12. Event Time–Based Stream Processing
    1. Understanding Event Time in Structured Streaming
    2. Using Event Time
    3. Processing Time
    4. Watermarks
    5. Time-Based Window Aggregations
      1. Defining Time-Based Windows
      2. Understanding How Intervals Are Computed
      3. Using Composite Aggregation Keys
      4. Tumbling and Sliding Windows
    6. Record Deduplication
    7. Summary
  18. 13. Advanced Stateful Operations
    1. Example: Car Fleet Management
    2. Understanding Group with State Operations
      1. Internal State Flow
    3. Using MapGroupsWithState
    4. Using FlatMapGroupsWithState
      1. Output Modes
      2. Managing State Over Time
    5. Summary
  19. 14. Monitoring Structured Streaming Applications
    1. The Spark Metrics Subsystem
      1. Structured Streaming Metrics
    2. The StreamingQuery Instance
      1. Getting Metrics with StreamingQueryProgress
    3. The StreamingQueryListener Interface
      1. Implementing a StreamingQueryListener
  20. 15. Experimental Areas: Continuous Processing and Machine Learning
    1. Continuous Processing
      1. Understanding Continuous Processing
      2. Using Continuous Processing
      3. Limitations
    2. Machine Learning
      1. Learning Versus Exploiting
      2. Applying a Machine Learning Model to a Stream
      3. Example: Estimating Room Occupancy by Using Ambient Sensors
      4. Online Training
  21. B. References for Part II
  22. III. Spark Streaming
  23. 16. Introducing Spark Streaming
    1. The DStream Abstraction
      1. DStreams as a Programming Model
      2. DStreams as an Execution Model
    2. The Structure of a Spark Streaming Application
      1. Creating the Spark Streaming Context
      2. Defining a DStream
      3. Defining Output Operations
      4. Starting the Spark Streaming Context
      5. Stopping the Streaming Process
    3. Summary
  24. 17. The Spark Streaming Programming Model
    1. RDDs as the Underlying Abstraction for DStreams
    2. Understanding DStream Transformations
    3. Element-Centric DStream Transformations
    4. RDD-Centric DStream Transformations
    5. Counting
    6. Structure-Changing Transformations
    7. Summary
  25. 18. The Spark Streaming Execution Model
    1. The Bulk-Synchronous Architecture
    2. The Receiver Model
      1. The Receiver API
      2. How Receivers Work
      3. The Receiver’s Data Flow
      4. The Internal Data Resilience
      5. Receiver Parallelism
      6. Balancing Resources: Receivers Versus Processing Cores
      7. Achieving Zero Data Loss with the Write-Ahead Log
    3. The Receiverless or Direct Model
    4. Summary
  26. 19. Spark Streaming Sources
    1. Types of Sources
      1. Basic Sources
      2. Receiver-Based Sources
      3. Direct Sources
    2. Commonly Used Sources
    3. The File Source
      1. How It Works
    4. The Queue Source
      1. How It Works
      2. Using a Queue Source for Unit Testing
      3. A Simpler Alternative to the Queue Source: The ConstantInputDStream
    5. The Socket Source
      1. How It Works
    6. The Kafka Source
      1. Using the Kafka Source
      2. How It Works
    7. Where to Find More Sources
  27. 20. Spark Streaming Sinks
    1. Output Operations
    2. Built-In Output Operations
      1. print
      2. saveAsxyz
      3. foreachRDD
    3. Using foreachRDD as a Programmable Sink
    4. Third-Party Output Operations
  28. 21. Time-Based Stream Processing
    1. Window Aggregations
    2. Tumbling Windows
      1. Window Length Versus Batch Interval
    3. Sliding Windows
      1. Sliding Windows Versus Batch Interval
      2. Sliding Windows Versus Tumbling Windows
    4. Using Windows Versus Longer Batch Intervals
    5. Window Reductions
      1. reduceByWindow
      2. reduceByKeyAndWindow
      3. countByWindow
      4. countByValueAndWindow
    6. Invertible Window Aggregations
    7. Slicing Streams
    8. Summary
  29. 22. Arbitrary Stateful Streaming Computation
    1. Statefulness at the Scale of a Stream
    2. updateStateByKey
    3. Limitation of updateStateByKey
      1. Performance
      2. Memory Usage
    4. Introducing Stateful Computation with mapwithState
    5. Using mapWithState
    6. Event-Time Stream Computation Using mapWithState
  30. 23. Working with Spark SQL
    1. Spark SQL
    2. Accessing Spark SQL Functions from Spark Streaming
      1. Example: Writing Streaming Data to Parquet
    3. Dealing with Data at Rest
      1. Using Join to Enrich the Input Stream
    4. Join Optimizations
    5. Updating Reference Datasets in a Streaming Application
      1. Enhancing Our Example with a Reference Dataset
    6. Summary
  31. 24. Checkpointing
    1. Understanding the Use of Checkpoints
    2. Checkpointing DStreams
    3. Recovery from a Checkpoint
      1. Limitations
    4. The Cost of Checkpointing
    5. Checkpoint Tuning
  32. 25. Monitoring Spark Streaming
    1. The Streaming UI
    2. Understanding Job Performance Using the Streaming UI
      1. Input Rate Chart
      2. Scheduling Delay Chart
      3. Processing Time Chart
      4. Total Delay Chart
      5. Batch Details
    3. The Monitoring REST API
      1. Using the Monitoring REST API
      2. Information Exposed by the Monitoring REST API
    4. The Metrics Subsystem
    5. The Internal Event Bus
      1. Interacting with the Event Bus
    6. Summary
  33. 26. Performance Tuning
    1. The Performance Balance of Spark Streaming
      1. The Relationship Between Batch Interval and Processing Delay
      2. The Last Moments of a Failing Job
      3. Going Deeper: Scheduling Delay and Processing Delay
      4. Checkpoint Influence in Processing Time
    2. External Factors that Influence the Job’s Performance
    3. How to Improve Performance?
    4. Tweaking the Batch Interval
    5. Limiting the Data Ingress with Fixed-Rate Throttling
    6. Backpressure
    7. Dynamic Throttling
      1. Tuning the Backpressure PID
      2. Custom Rate Estimator
      3. A Note on Alternative Dynamic Handling Strategies
    8. Caching
    9. Speculative Execution
  34. C. References for Part III
  35. IV. Advanced Spark Streaming Techniques
  36. 27. Streaming Approximation and Sampling Algorithms
    1. Exactness, Real Time, and Big Data
      1. Exactness
      2. Real-Time Processing
      3. Big Data
    2. The Exactness, Real-Time, and Big Data triangle
      1. Big Data and Real Time
    3. Approximation Algorithms
    4. Hashing and Sketching: An Introduction
    5. Counting Distinct Elements: HyperLogLog
      1. Role-Playing Exercise: If We Were a System Administrator
      2. Practical HyperLogLog in Spark
    6. Counting Element Frequency: Count Min Sketches
      1. Introducing Bloom Filters
      2. Bloom Filters with Spark
      3. Computing Frequencies with a Count-Min Sketch
    7. Ranks and Quantiles: T-Digest
      1. T-Digest in Spark
    8. Reducing the Number of Elements: Sampling
      1. Random Sampling
      2. Stratified Sampling
  37. 28. Real-Time Machine Learning
    1. Streaming Classification with Naive Bayes
      1. streamDM Introduction
      2. Naive Bayes in Practice
      3. Training a Movie Review Classifier
    2. Introducing Decision Trees
    3. Hoeffding Trees
      1. Hoeffding Trees in Spark, in Practice
    4. Streaming Clustering with Online K-Means
      1. K-Means Clustering
      2. Online Data and K-Means
      3. The Problem of Decaying Clusters
      4. Streaming K-Means with Spark Streaming
  38. D. References for Part IV
  39. V. Beyond Apache Spark
  40. 29. Other Distributed Real-Time Stream Processing Systems
    1. Apache Storm
      1. Processing Model
      2. The Storm Topology
      3. The Storm Cluster
      4. Compared to Spark
    2. Apache Flink
      1. A Streaming-First Framework
      2. Compared to Spark
    3. Kafka Streams
      1. Kafka Streams Programming Model
      2. Compared to Spark
    4. In the Cloud
      1. Amazon Kinesis on AWS
      2. Microsoft Azure Stream Analytics
      3. Apache Beam/Google Cloud Dataflow
  41. 30. Looking Ahead
    1. Stay Plugged In
      1. Seek Help on Stack Overflow
      2. Start Discussions on the Mailing Lists
      3. Attend Conferences
    2. Attend Meetups
      1. Read Books
    3. Contributing to the Apache Spark Project
  42. E. References for Part V
  43. Index