Book description
A friendly, framework-agnostic tutorial that will help you grok how streaming systems work—and how to build your own!In Grokking Streaming Systems you will learn how to:
- Implement and troubleshoot streaming systems
- Design streaming systems for complex functionalities
- Assess parallelization requirements
- Spot networking bottlenecks and resolve back pressure
- Group data for high-performance systems
- Handle delayed events in real-time systems
Grokking Streaming Systems is a simple guide to the complex concepts behind streaming systems. This friendly and framework-agnostic tutorial teaches you how to handle real-time events, and even design and build your own streaming job that’s a perfect fit for your needs. Each new idea is carefully explained with diagrams, clear examples, and fun dialogue between perplexed personalities!
About the Technology
Streaming systems minimize the time between receiving and processing event data, so they can deliver responses in real time. For applications in finance, security, and IoT where milliseconds matter, streaming systems are a requirement. And streaming is hot! Skills on platforms like Spark, Heron, and Kafka are in high demand.
About the Book
Grokking Streaming Systems introduces real-time event streaming applications in clear, reader-friendly language. This engaging book illuminates core concepts like data parallelization, event windows, and backpressure without getting bogged down in framework-specific details. As you go, you’ll build your own simple streaming tool from the ground up to make sure all the ideas and techniques stick. The helpful and entertaining illustrations make streaming systems come alive as you tackle relevant examples like real-time credit card fraud detection and monitoring IoT services.
What's Inside
- Implement and troubleshoot streaming systems
- Design streaming systems for complex functionalities
- Spot networking bottlenecks and resolve backpressure
- Group data for high-performance systems
About the Reader
No prior experience with streaming systems is assumed. Examples in Java.
About the Authors
Josh Fischer and Ning Wang are Apache Committers, and part of the committee for the Apache Heron distributed stream processing engine.
Quotes
Very well-written and enjoyable. I recommend this book to all software engineers working on data processing.
- Apoorv Gupta, Facebook
Finally, a much-needed introduction to streaming systems—a must-read for anyone interested in this technology.
- Anupam Sengupta, Red Hat
Tackles complex topics in a very approachable manner.
- Marc Roulleau, GIRO
A superb resource for helping you grasp the fundamentals of open-source streaming systems.
- Simon Verhoeven, Cronos
Explains all the main streaming concepts in a friendly way. Start with this one!
- Cicero Zandona, Calypso Technologies
Table of contents
- inside front cover
- Grokking Streaming Systems
- Copyright
- brief contents
- contents
- front matter
- Part 1. Getting started with streaming
-
1 Welcome to Grokking Streaming Systems
- What is stream processing?
- Streaming system examples
- Streaming systems and real time
- How a streaming system works
- Applications
- Backend services
- Inside a backend service
- Batch processing systems
- Inside a batch processing system
- Stream processing systems
- Inside a stream processing system
- The advantages of multi-stage architecture
- The multi-stage architecture in batch and stream processing systems
- Compare the systems
- A model stream processing system
- Summary
- Exercise
-
2 Hello, streaming systems!
- The chief needs a fancy tollbooth
- It started as HTTP requests, and it failed
- AJ and Miranda take time to reflect
- AJ ponders about streaming systems
- Comparing backend service and streaming
- How a streaming system could fit
- Queues: A foundational concept
- Data transfer via queues
- Our streaming framework (the start of it)
- The Streamwork framework overview
- Zooming in on the Streamwork engine
- Core streaming concepts
- More details of the concepts
- The streaming job execution flow
- Your first streaming job
- Executing the job
- Inspecting the job execution
- Look inside the engine
- Keep events moving
- The life of a data element
- Reviewing streaming concepts
- Summary
- Exercises
-
3 Parallelization and data grouping
- The sensor is emitting more events
- Even in streaming, real time is hard
- New concepts: Parallelism is important
- New concepts: Data parallelism
- New concepts: Data execution independence
- New concepts: Task parallelism
- Data parallelism vs. task parallelism
- Parallelism and concurrency
- Parallelizing the job
- Parallelizing components
- Parallelizing sources
- Viewing job output
- Parallelizing operators
- Viewing job output
- Events and instances
- Event ordering
- Event grouping
- Shuffle grouping
- Shuffle grouping: Under the hood
- Fields grouping
- Fields grouping: Under the hood
- Event grouping execution
- Look inside the engine: Event dispatcher
- Applying fields grouping in your job
- Event ordering
- Comparing grouping behaviors
- Summary
- Exercises
-
4 Stream graph
- A credit card fraud detection system
- More about the credit card fraud detection system
- The fraud detection business
- Streaming isn’t always a straight line
- Zoom into the system
- The fraud detection job in detail
- New concepts
- Upstream and downstream components
- Stream fan-out and fan-in
- Graph, directed graph, and DAG
- DAG in stream processing systems
- All new concepts in one page
- Stream fan-out to the analyzers
- Look inside the engine
- There is a problem: Efficiency
- Stream fan-out with different streams
- Look inside the engine again
- Communication between the components via channels
- Multiple channels
- Stream fan-in to the score aggregator
- Stream fan-in in the engine
- A brief introduction to anotherstream fan-in: Join
- Look at the whole system
- Graph and streaming jobs
- The example systems
- Summary
- Exercises
-
5 Delivery semantics
- The latency requirement of the fraud detection system
- Revisit the fraud detection job
- About accuracy
- Partial result
- A new streaming job to monitor system usage
- The new system usage job
- The requirements of the new system usage job
- New concepts: (The number of) times delivered and times processed
- New concept: Delivery semantics
- Choosing the right semantics
- At-most-once
- The fraud detection job
- At-least-once
- At-least-once with acknowledging
- Track events
- Handle event processing failures
- Track early out events
- Acknowledging code in components
- New concept: Checkpointing
- New concept: State
- Checkpointing in the system usage job for the at-least-once semantic
- Checkpointing and state manipulation functions
- State handling code in the transaction source component
- Exactly-once or effectively-once?
- Bonus concept: Idempotent operation
- Exactly-once, finally
- State handling code in the system usage analyzer component
- Comparing the delivery semantics again
- Summary
- Exercises
- Up next ...
-
6 Streaming systems review and a glimpse ahead
- Streaming system pieces
- Parallelization and event grouping
- DAGs and streaming jobs
- Delivery semantics (guarantees)
- Delivery semantics used in the credit card fraud detection system
- Which way to go from here
- Windowed computations
- Joining data in real time
- Backpressure
- Stateless and stateful computations
- Part 2. Stepping up
-
7 Windowed computations
- Slicing up real-time data
- Breaking down the problem in detail
- Breaking down the problem in detail (continued)
- Two different contexts
- Windowing in the fraud detection job
- What exactly are windows?
- Looking closer into the window
- New concept: Windowing strategy
- Fixed windows
- Fixed windows in the windowed proximity analyzer
- Detecting fraud with a fixed time window
- Fixed windows: Time vs. count
- Sliding windows
- Sliding windows: Windowed proximity analyzer
- Detecting fraud with a sliding window
- Session windows
- Session windows (continued)
- Detecting fraud with session windows
- Summary of windowing strategies
- Slicing an event stream into data sets
- Windowing: Concept or implementation
- Another look
- Key–value store 101
- Implement the windowed proximity analyzer
- Event time and other times for events
- Windowing watermark
- Late events
- Summary
- Exercise
-
8 Join operations
- Joining emission data on the fly
- The emissions job version 1
- The emission resolver
- Accuracy becomes an issue
- The enhanced emissions job
- Focusing on the join
- What is a join again?
- How the stream join works
- Stream join is a different kind of fan-in
- Vehicle events vs. temperature events
- Table: A materialized view of streaming
- Vehicle events are less efficient to be materialized
- Data integrity quickly became an issue
- What’s the problem with this join operator?
- Inner join
- Outer join
- The inner join vs. outer join
- Different types of joins
- Outer joins in streaming systems
- A new issue: Weak connection
- Windowed joins
- Joining two tables instead of joining a stream and table
- Revisiting the materialized view
- Summary
-
9 Backpressure
- Reliability is critical
- Review the system
- Streamlining streaming jobs
- New concepts: Capacity, utilization, and headroom
- More about utilization and headroom
- New concept: Backpressure
- Measure capacity utilization
- Backpressure in the Streamwork engine
- Backpressure in the Streamwork engine: Propagation
- Our streaming job during a backpressure
- Backpressure in distributed systems
- New concept: Backpressure watermarks
- Another approach to handle lagging instances: Dropping events
- Why do we want to drop events?
- Backpressure could be a symptom when the underlying issue is permanent
- Stopping and resuming may lead to thrashing if the issue is permanent
- Handle thrashing
- Summary
-
10 Stateful computation
- The migration of the streaming jobs
- Stateful components in the system usage job
- Revisit: State
- The states in different components
- State data vs. temporary data
- Stateful vs. stateless components: The code
- The stateful source and operator in the system usage job
- States and checkpoints
- Checkpoint creation: Timing is hard
- Event-based timing
- Creating checkpoints with checkpoint events
- A checkpoint event is handled by instance executors
- A checkpoint event flowing through a job
- Creating checkpoints with checkpoint events at the instance level
- Checkpoint event synchronization
- Checkpoint loading and backward compatibility
- Checkpoint storage
- Stateful vs. stateless components
- Manually managed instance states
- Lambda architecture
- Summary
- Exercises
-
11 Wrap-up: Advanced concepts in streaming systems
- Is this really the end?
- Windowed computations
- The major window types
- Joining data in real time
- SQL vs. stream joins
- Inner joins vs. outer joins
- Unexpected things can happen in streaming systems
- Backpressure: Slow down sources or upstream components
- Another approach to handle lagging instances: Dropping events
- Backpressure can be a symptom when the underlying issue is permanent
- Stateful components with checkpoints
- Event-based timing
- Stateful vs. stateless components
- You did it!
- Appendix. Key concepts covered in this book
- index
Product information
- Title: Grokking Streaming Systems
- Author(s):
- Release date: March 2022
- Publisher(s): Manning Publications
- ISBN: 9781617297304
You might also like
book
Streaming Systems
Streaming data is a big deal in big data these days. As more and more businesses …
book
Patterns of Distributed Systems
A Patterns Approach to Designing Distributed Systems and Solving Common Implementation Problems More and more enterprises …
book
Designing Distributed Systems
Without established design patterns to guide them, developers have had to build distributed systems from scratch, …
book
Streaming Data Mesh
Data lakes and warehouses have become increasingly fragile, costly, and difficult to maintain as data gets …