Stream Processing with Apache Spark

Released

Publisher(s): O'Reilly Media, Inc.

ISBN: None

Start your free trial

Book description

None

Foreword
Preface
1. Who Should Read This Book?
2. Installing Spark
3. Learning Scala
4. The Way Ahead
5. Bibliography
6. Conventions Used in This Book
7. Using Code Examples
8. OâReilly Online Learning
9. How to Contact Us
10. Acknowledgments
  1. From Gerard
  2. From FranÃ§ois
I. Fundamentals of Stream Processing with Apache Spark
1. Introducing Stream Processing
1. What Is Stream Processing?
2. Some Examples of Stream Processing
3. Scaling Up Data Processing
  1. MapReduce
  2. The Lesson Learned: Scalability and Fault Tolerance
4. Distributed Stream Processing
  1. Stateful Stream Processing in a Distributed System
5. Introducing Apache Spark
6. Where Next?
2. Stream-Processing Model
1. Sources and Sinks
2. Immutable Streams Defined from One Another
3. Transformations and Aggregations
4. Window Aggregations
  1. Tumbling Windows
  2. Sliding Windows
5. Stateless and Stateful Processing
6. Stateful Streams
7. An Example: Local Stateful Computation in Scala
  1. A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
8. Stateless or Stateful Streaming
9. The Effect of Time
10. Summary
3. Streaming Architectures
1. Components of a Data Platform
2. Architectural Models
3. The Use of a Batch-Processing Component in a Streaming Application
4. Referential Streaming Architectures
  1. The Lambda Architecture
  2. The Kappa Architecture
5. Streaming Versus Batch Algorithms
  1. Streaming Algorithms Are Sometimes Completely Different in Nature
  2. Streaming Algorithms Canât Be Guaranteed to Measure Well Against Batch Algorithms
6. Summary
4. Apache Spark as a Stream-Processing Engine
1. The Tale of Two APIs
2. Sparkâs Memory Usage
3. Understanding Latency
4. Throughput-Oriented Processing
5. Sparkâs Polyglot API
6. Fast Implementation of Data Analysis
7. To Learn More About Spark
8. Summary
5. Sparkâs Distributed Processing Model
1. Running Apache Spark with a Cluster Manager
  1. Examples of Cluster Managers
2. Sparkâs Own Cluster Manager
3. Understanding Resilience and Fault Tolerance in a Distributed System
  1. Fault Recovery
  2. Cluster Manager Support for Fault Tolerance
4. Data Delivery Semantics
5. Microbatching and One-Element-at-a-Time
6. Bringing Microbatch and One-Record-at-a-Time Closer Together
7. Dynamic Batch Interval
8. Structured Streaming Processing Model
  1. The Disappearance of the Batch Interval
6. Sparkâs Resilience Model
1. Resilient Distributed Datasets in Spark
2. Spark Components
3. Sparkâs Fault-Tolerance Guarantees
4. Summary
A. References for Part I
II. Structured Streaming
7. Introducing Structured Streaming
1. First Steps with Structured Streaming
2. Batch Analytics
3. Streaming Analytics
4. Summary
8. The Structured Streaming Programming Model
1. Initializing Spark
2. Sources: Acquiring Streaming Data
  1. Available Sources
3. Transforming Streaming Data
  1. Streaming API Restrictions on the DataFrame API
4. Sinks: Output the Resulting Data
  1. format
  2. outputMode
  3. queryName
  4. option
  5. options
  6. trigger
  7. start()
5. Summary
9. Structured Streaming in Action
1. Consuming a Streaming Source
2. Application Logic
3. Writing to a Streaming Sink
4. Summary
10. Structured Streaming Sources
1. Understanding Sources
  1. Reliable Sources Must Be Replayable
  2. Sources Must Provide a Schema
2. Available Sources
3. The File Source
4. The Kafka Source
5. The Socket Source
  1. Configuration
  2. Operations
6. The Rate Source
  1. Options
11. Structured Streaming Sinks
1. Understanding Sinks
2. Available Sinks
3. The File Sink
4. The Kafka Sink
  1. Understanding the Kafka Publish Model
  2. Using the Kafka Sink
5. The Memory Sink
  1. Output Modes
6. The Console Sink
  1. Options
  2. Output Modes
7. The Foreach Sink
12. Event TimeâBased Stream Processing
1. Understanding Event Time in Structured Streaming
2. Using Event Time
3. Processing Time
4. Watermarks
5. Time-Based Window Aggregations
6. Record Deduplication
7. Summary
13. Advanced Stateful Operations
1. Example: Car Fleet Management
2. Understanding Group with State Operations
  1. Internal State Flow
3. Using MapGroupsWithState
4. Using FlatMapGroupsWithState
  1. Output Modes
  2. Managing State Over Time
5. Summary
14. Monitoring Structured Streaming Applications
1. The Spark Metrics Subsystem
  1. Structured Streaming Metrics
2. The StreamingQuery Instance
  1. Getting Metrics with StreamingQueryProgress
3. The StreamingQueryListener Interface
  1. Implementing a StreamingQueryListener
15. Experimental Areas: Continuous Processing and Machine Learning
1. Continuous Processing
2. Machine Learning
B. References for Part II
III. Spark Streaming
16. Introducing Spark Streaming
1. The DStream Abstraction
  1. DStreams as a Programming Model
  2. DStreams as an Execution Model
2. The Structure of a Spark Streaming Application
3. Summary
17. The Spark Streaming Programming Model
1. RDDs as the Underlying Abstraction for DStreams
2. Understanding DStream Transformations
3. Element-Centric DStream Transformations
4. RDD-Centric DStream Transformations
5. Counting
6. Structure-Changing Transformations
7. Summary
18. The Spark Streaming Execution Model
1. The Bulk-Synchronous Architecture
2. The Receiver Model
3. The Receiverless or Direct Model
4. Summary
19. Spark Streaming Sources
1. Types of Sources
2. Commonly Used Sources
3. The File Source
  1. How It Works
4. The Queue Source
5. The Socket Source
  1. How It Works
6. The Kafka Source
  1. Using the Kafka Source
  2. How It Works
7. Where to Find More Sources
20. Spark Streaming Sinks
1. Output Operations
2. Built-In Output Operations
3. Using foreachRDD as a Programmable Sink
4. Third-Party Output Operations
21. Time-Based Stream Processing
1. Window Aggregations
2. Tumbling Windows
  1. Window Length Versus Batch Interval
3. Sliding Windows
  1. Sliding Windows Versus Batch Interval
  2. Sliding Windows Versus Tumbling Windows
4. Using Windows Versus Longer Batch Intervals
5. Window Reductions
6. Invertible Window Aggregations
7. Slicing Streams
8. Summary
22. Arbitrary Stateful Streaming Computation
1. Statefulness at the Scale of a Stream
2. updateStateByKey
3. Limitation of updateStateByKey
  1. Performance
  2. Memory Usage
4. Introducing Stateful Computation with mapwithState
5. Using mapWithState
6. Event-Time Stream Computation Using mapWithState
23. Working with Spark SQL
1. Spark SQL
2. Accessing Spark SQL Functions from Spark Streaming
  1. Example: Writing Streaming Data to Parquet
3. Dealing with Data at Rest
  1. Using Join to Enrich the Input Stream
4. Join Optimizations
5. Updating Reference Datasets in a Streaming Application
  1. Enhancing Our Example with a Reference Dataset
6. Summary
24. Checkpointing
1. Understanding the Use of Checkpoints
2. Checkpointing DStreams
3. Recovery from a Checkpoint
  1. Limitations
4. The Cost of Checkpointing
5. Checkpoint Tuning
25. Monitoring Spark Streaming
1. The Streaming UI
2. Understanding Job Performance Using the Streaming UI
3. The Monitoring REST API
  1. Using the Monitoring REST API
  2. Information Exposed by the Monitoring REST API
4. The Metrics Subsystem
5. The Internal Event Bus
  1. Interacting with the Event Bus
6. Summary
26. Performance Tuning
1. The Performance Balance of Spark Streaming
2. External Factors that Influence the Jobâs Performance
3. How to Improve Performance?
4. Tweaking the Batch Interval
5. Limiting the Data Ingress with Fixed-Rate Throttling
6. Backpressure
7. Dynamic Throttling
8. Caching
9. Speculative Execution
C. References for Part III
IV. Advanced Spark Streaming Techniques
27. Streaming Approximation and Sampling Algorithms
1. Exactness, Real Time, and Big Data
2. The Exactness, Real-Time, and Big Data triangle
  1. Big Data and Real Time
3. Approximation Algorithms
4. Hashing and Sketching: An Introduction
5. Counting Distinct Elements: HyperLogLog
  1. Role-Playing Exercise: If We Were a System Administrator
  2. Practical HyperLogLog in Spark
6. Counting Element Frequency: Count Min Sketches
7. Ranks and Quantiles: T-Digest
  1. T-Digest in Spark
8. Reducing the Number of Elements: Sampling
  1. Random Sampling
  2. Stratified Sampling
28. Real-Time Machine Learning
1. Streaming Classification with Naive Bayes
2. Introducing Decision Trees
3. Hoeffding Trees
  1. Hoeffding Trees in Spark, in Practice
4. Streaming Clustering with Online K-Means
D. References for Part IV
V. Beyond Apache Spark
29. Other Distributed Real-Time Stream Processing Systems
1. Apache Storm
2. Apache Flink
  1. A Streaming-First Framework
  2. Compared to Spark
3. Kafka Streams
  1. Kafka Streams Programming Model
  2. Compared to Spark
4. In the Cloud
30. Looking Ahead
1. Stay Plugged In
2. Attend Meetups
  1. Read Books
3. Contributing to the Apache Spark Project
E. References for Part V
Index

Product information

Title: Stream Processing with Apache Spark
Author(s):
Release date:
Publisher(s): O'Reilly Media, Inc.
ISBN: None

book

Stream Processing with Apache Flink

by Fabian Hueske, Vasiliki Kalavri

Get started with Apache Flink, the open source framework that powers some of the world’s largest …

video

Real-Time Stream Processing Using Apache Spark 3 for Python Developers

by ScholarNest

Take your first steps towards discovering, learning, and using Apache Spark 3.0. We will be taking …

video

Real-Time Stream Processing Using Apache Spark 3 for Scala Developers

by ScholarNest

Since its inception, Apache Spark has seen rapid adoption by enterprises across a wide range of …

video

Apache Spark with Python - Big Data with PySpark and Spark

by James Lee

This course covers all the fundamentals of Apache Spark with Python and teaches you everything you …

Stream Processing with Apache Spark

Book description

Table of contents

Product information

You might also like

Stream Processing with Apache Flink

Real-Time Stream Processing Using Apache Spark 3 for Python Developers

Real-Time Stream Processing Using Apache Spark 3 for Scala Developers

Apache Spark with Python - Big Data with PySpark and Spark

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly