O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Stream Processing with Apache Spark

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

To build analytics tools that provide faster insights, knowing how to process data in real time is a must, and moving from batch processing to stream processing is absolutely required. Fortunately, the Spark in-memory framework/platform for processing data has added an extension devoted to fault-tolerant stream processing: Spark Streaming.

If you're familiar with Apache Spark and want to learn how to implement it for streaming jobs, this practical book is a must.

  • Understand how Spark Streaming fits in the big picture
  • Learn core concepts such as Spark RDDs, Spark Streaming clusters, and the fundamentals of a DStream
  • Discover how to create a robust deployment
  • Dive into streaming algorithmics
  • Learn how to tune, measure, and monitor Spark Streaming

Table of Contents

  1. Preface
    1. Why we wrote this book
    2. How’s this book for?
    3. Installing Spark
    4. Learning Scala
    5. The Way Ahead
    6. Bibliography
  2. I. Fundamentals of Stream Processing with Apache Spark
    1. 1. Introducing Stream Processing
      1. What is Stream Processing?
        1. Batch vs. Stream Processing
        2. The Notion of Time in Stream Processing
        3. The Factor of Uncertainty
      2. Some Examples of Stream Processing
      3. Scaling Up Data Processing
        1. MapReduce
        2. The Lesson Learned: Scalability and Fault-Tolerance
      4. Distributed Stream Processing
        1. Stateful Stream Processing in a Distributed System
      5. Introducing Apache Spark
        1. The First Wave: Functional APIs
        2. The Second Wave: SQL
        3. A Unified Engine
        4. Spark Components
        5. Spark Streaming
        6. Structured Streaming
      6. Where Next?
    2. 2. Stream Processing Model
      1. Sources and Sinks
      2. Immutable Streams Defined From Each Other
      3. Transformations and Aggregations
      4. Window Aggregations
        1. Tumbling Windows
        2. Sliding Windows
      5. Stateless and Stateful processing
      6. Stateful Streams
      7. An example: Local Stateful Computation in Scala
        1. A stateless definition of the Fibonacci sequence as a stream transformation
        2. A stateful definition of the Fibonacci sequence
      8. The Effect of Time
        1. Computing on time-stamped events
        2. Timestamps as the providers of the notion of time
        3. Event Time vs Processing Time
        4. Computing with a watermark
      9. Summary
    3. 3. Distributed Stream Processing
      1. Spark’s own cluster manager
      2. Understanding Resilience and Fault Tolerance in a Distributed System
        1. Fault Recovery
        2. Cluster Manager Support for Fault Tolerance
      3. Delivery guarantees
      4. Micro-batching and one-element-at-a time
    4. 4. Streaming Architectures
      1. The use of a batch processing component in a streaming application
      2. The Lambda and Kappa Architectures, or how to compare streaming applications
      3. Streaming algorithms are sometimes completely different in nature
      4. Streaming algorithms can’t be guaranteed to measure well against batch algorithms
    5. 5. Apache Spark
    6. 6. Spark in Practice: How to leverage Apache’s Spark rock-solid fault tolerance for stream processing
      1. Thinking about reliability: Closures
      2. Spark’s Reliability primitives
      3. Spark’s Fault Tolerance Guarantees
        1. Task failure
        2. The External shuffle service
        3. Driver failure
        4. Cluster-mode deployment
        5. Checkpointing
        6. A hot-swappable master through Zookeeper
      4. Fault-tolerance in Spark Streaming: the context of the Receiver model
      5. Spark Streaming’s Zero Data Loss guarantees
      6. Cluster managers and driver restart
      7. Comparing cluster managers
      8. Bibliography
  3. II. Structured Streaming
    1. 7. Introducing Structured Streaming
      1. First Steps with Structured Streaming
      2. Batch Analytics
      3. Streaming Analytics
        1. Connecting to a Stream
        2. Preparing the Data in the Stream
        3. Operations on Streaming Dataset
        4. Creating a Query
        5. Start the Stream Processing
        6. Exploring the Data
      4. Summary
    2. 8. The Structured Streaming Programming Model
      1. Initializing Spark
      2. Sources: acquiring streaming data
        1. Available Sources
      3. Transforming Streaming Data
        1. Streaming API Restrictions
      4. Sinks: output the resulting data
        1. format
        2. outputMode
        3. queryName
        4. option
        5. options
        6. trigger
      5. Summary
    3. 9. Structured Streaming in Action
      1. Consuming a Streaming Source
      2. Application Logic
      3. Writing to a Streaming Sink
      4. Summary
    4. 10. Structured Streaming Sources
      1. Understanding Sources
        1. Sources must be replayable
        2. Sources must provide a schema
      2. Available Sources
      3. The File Source
        1. Specifying a File Format
        2. Common Options
        3. Common Text Parsing Options (CSV, JSON)
        4. CSV File Source Format
        5. Parquet File Source Format
        6. Text File Source Format
      4. The Kafka Source
        1. Setting up a Kafka Source
        2. Selecting a Topic Subscription Method
        3. Configuring Kafka Source Options
        4. Kafka Consumer Options
      5. The Socket Source
        1. Configuration
        2. Operations
      6. The Rate Source
        1. Options
    5. 11. Structured Streaming Sinks
      1. Understanding Sinks
      2. Available Sinks
        1. Reliable Sinks
        2. Sinks for Experimentation
        3. The Sink API
        4. Exploring Sinks in Detail
      3. The File Sink
        1. Using Triggers with the File Sink
        2. Common Configuration Options Across All Supported File Formats
        3. Common Time and Date Formatting (CSV, JSON)
        4. The CSV Format of the File Sink
        5. The JSON File Sink Format
        6. The Parquet File Sink Format
        7. The Text File Sink Format
      4. The Kafka Sink
        1. Understanding Kafka Publish model
        2. Using the Kafka Sink
      5. The Memory Sink
        1. Output Modes
      6. The Console Sink
        1. Options
        2. Output Modes
      7. The Foreach Sink
        1. The ForeachWriter Interface
        2. TCP Writer Sink: A Practical ForeachWriter example
        3. The Moral of this Example
        4. Troubleshooting ForeachWriter Serialization Issues
    6. 12. Event Time Based Stream Processing
      1. Understanding Event Time in Structured Streaming
      2. Using Event Time
      3. Processing Time
      4. Watermarks
      5. Time-based Window Aggregations
        1. Defining Time-based Windows
        2. Understanding How Intervals are Computed.
        3. Using Composite Aggregation Keys
        4. Tumbling and Sliding Windows
    7. 13. Advanced Stateful Operations
      1. Starting with an Example
      2. Understanding group with state operations
        1. Internal State Flow
      3. Using MapGroupsWithState
      4. Using FlatMapGroupsWithState
        1. Output Modes
        2. Managing Timeouts
      5. Summary
    8. 14. Monitoring Structured Streaming Applications
      1. The Spark’s Metrics Subsystem
        1. Structured Streaming Metrics
      2. The StreamingQuery Instance
        1. Getting Metrics with StreamingQueryProgress
      3. The StreamingQueryListener Interface
        1. Implementing a StreamingQueryListener
    9. 15. Experimental Areas: Continuous Processing and Machine Learning
      1. Continous Processing
        1. Understanding Continuous Processing
        2. Using Continuous Processing
      2. Machine Learning
        1. Learning vs Exploiting
        2. Applying a ML Model to a Stream
        3. Online Training