Book description
Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs.
Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API.
- Learn fundamental stream processing concepts and examine different streaming architectures
- Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail
- Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs
- Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms
- Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams
Publisher resources
Table of contents
- Foreword
- Preface
- I. Fundamentals of Stream Processing with Apache Spark
- 1. Introducing Stream Processing
- 2. Stream-Processing Model
- 3. Streaming Architectures
- 4. Apache Spark as a Stream-Processing Engine
-
5. Spark’s Distributed Processing Model
- Running Apache Spark with a Cluster Manager
- Spark’s Own Cluster Manager
- Understanding Resilience and Fault Tolerance in a Distributed System
- Data Delivery Semantics
- Microbatching and One-Element-at-a-Time
- Bringing Microbatch and One-Record-at-a-Time Closer Together
- Dynamic Batch Interval
- Structured Streaming Processing Model
- 6. Spark’s Resilience Model
- A. References for Part I
- II. Structured Streaming
- 7. Introducing Structured Streaming
- 8. The Structured Streaming Programming Model
- 9. Structured Streaming in Action
- 10. Structured Streaming Sources
- 11. Structured Streaming Sinks
- 12. Event Time–Based Stream Processing
- 13. Advanced Stateful Operations
- 14. Monitoring Structured Streaming Applications
- 15. Experimental Areas: Continuous Processing and Machine Learning
- B. References for Part II
- III. Spark Streaming
- 16. Introducing Spark Streaming
- 17. The Spark Streaming Programming Model
- 18. The Spark Streaming Execution Model
- 19. Spark Streaming Sources
- 20. Spark Streaming Sinks
- 21. Time-Based Stream Processing
- 22. Arbitrary Stateful Streaming Computation
- 23. Working with Spark SQL
- 24. Checkpointing
- 25. Monitoring Spark Streaming
- 26. Performance Tuning
- C. References for Part III
- IV. Advanced Spark Streaming Techniques
-
27. Streaming Approximation and Sampling Algorithms
- Exactness, Real Time, and Big Data
- The Exactness, Real-Time, and Big Data triangle
- Approximation Algorithms
- Hashing and Sketching: An Introduction
- Counting Distinct Elements: HyperLogLog
- Counting Element Frequency: Count Min Sketches
- Ranks and Quantiles: T-Digest
- Reducing the Number of Elements: Sampling
- 28. Real-Time Machine Learning
- D. References for Part IV
- V. Beyond Apache Spark
- 29. Other Distributed Real-Time Stream Processing Systems
- 30. Looking Ahead
- E. References for Part V
- Index
Product information
- Title: Stream Processing with Apache Spark
- Author(s):
- Release date: June 2019
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491944240
You might also like
book
Streaming Data: Understanding the real-time pipeline
Summary Streaming Data introduces the concepts and requirements of streaming and real-time data systems. The book …
book
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. …
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Flow Architectures
Dominated by streaming data and events, the next generation of software development optimizes not only how …