Grokking Streaming Systems

Book description

A friendly, framework-agnostic tutorial that will help you grok how streaming systems work—and how to build your own!

In Grokking Streaming Systems you will learn how to:

  • Implement and troubleshoot streaming systems
  • Design streaming systems for complex functionalities
  • Assess parallelization requirements
  • Spot networking bottlenecks and resolve back pressure
  • Group data for high-performance systems
  • Handle delayed events in real-time systems

Grokking Streaming Systems is a simple guide to the complex concepts behind streaming systems. This friendly and framework-agnostic tutorial teaches you how to handle real-time events, and even design and build your own streaming job that’s a perfect fit for your needs. Each new idea is carefully explained with diagrams, clear examples, and fun dialogue between perplexed personalities!

About the Technology
Streaming systems minimize the time between receiving and processing event data, so they can deliver responses in real time. For applications in finance, security, and IoT where milliseconds matter, streaming systems are a requirement. And streaming is hot! Skills on platforms like Spark, Heron, and Kafka are in high demand.

About the Book
Grokking Streaming Systems introduces real-time event streaming applications in clear, reader-friendly language. This engaging book illuminates core concepts like data parallelization, event windows, and backpressure without getting bogged down in framework-specific details. As you go, you’ll build your own simple streaming tool from the ground up to make sure all the ideas and techniques stick. The helpful and entertaining illustrations make streaming systems come alive as you tackle relevant examples like real-time credit card fraud detection and monitoring IoT services.

What's Inside
  • Implement and troubleshoot streaming systems
  • Design streaming systems for complex functionalities
  • Spot networking bottlenecks and resolve backpressure
  • Group data for high-performance systems


About the Reader
No prior experience with streaming systems is assumed. Examples in Java.

About the Authors
Josh Fischer and Ning Wang are Apache Committers, and part of the committee for the Apache Heron distributed stream processing engine.

Quotes
Very well-written and enjoyable. I recommend this book to all software engineers working on data processing.
- Apoorv Gupta, Facebook

Finally, a much-needed introduction to streaming systems—a must-read for anyone interested in this technology.
- Anupam Sengupta, Red Hat

Tackles complex topics in a very approachable manner.
- Marc Roulleau, GIRO

A superb resource for helping you grasp the fundamentals of open-source streaming systems.
- Simon Verhoeven, Cronos

Explains all the main streaming concepts in a friendly way. Start with this one!
- Cicero Zandona, Calypso Technologies

Table of contents

  1. inside front cover
  2. Grokking Streaming Systems
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
    1. preface
    2. acknowledgments
    3. about this book
    4. about the authors
  7. Part 1. Getting started with streaming
  8. 1 Welcome to Grokking Streaming Systems
    1. What is stream processing?
    2. Streaming system examples
    3. Streaming systems and real time
    4. How a streaming system works
    5. Applications
    6. Backend services
    7. Inside a backend service
    8. Batch processing systems
    9. Inside a batch processing system
    10. Stream processing systems
    11. Inside a stream processing system
    12. The advantages of multi-stage architecture
    13. The multi-stage architecture in batch and stream processing systems
    14. Compare the systems
    15. A model stream processing system
    16. Summary
    17. Exercise
  9. 2 Hello, streaming systems!
    1. The chief needs a fancy tollbooth
    2. It started as HTTP requests, and it failed
    3. AJ and Miranda take time to reflect
    4. AJ ponders about streaming systems
    5. Comparing backend service and streaming
    6. How a streaming system could fit
    7. Queues: A foundational concept
    8. Data transfer via queues
    9. Our streaming framework (the start of it)
    10. The Streamwork framework overview
    11. Zooming in on the Streamwork engine
    12. Core streaming concepts
    13. More details of the concepts
    14. The streaming job execution flow
    15. Your first streaming job
    16. Executing the job
    17. Inspecting the job execution
    18. Look inside the engine
    19. Keep events moving
    20. The life of a data element
    21. Reviewing streaming concepts
    22. Summary
    23. Exercises
  10. 3 Parallelization and data grouping
    1. The sensor is emitting more events
    2. Even in streaming, real time is hard
    3. New concepts: Parallelism is important
    4. New concepts: Data parallelism
    5. New concepts: Data execution independence
    6. New concepts: Task parallelism
    7. Data parallelism vs. task parallelism
    8. Parallelism and concurrency
    9. Parallelizing the job
    10. Parallelizing components
    11. Parallelizing sources
    12. Viewing job output
    13. Parallelizing operators
    14. Viewing job output
    15. Events and instances
    16. Event ordering
    17. Event grouping
    18. Shuffle grouping
    19. Shuffle grouping: Under the hood
    20. Fields grouping
    21. Fields grouping: Under the hood
    22. Event grouping execution
    23. Look inside the engine: Event dispatcher
    24. Applying fields grouping in your job
    25. Event ordering
    26. Comparing grouping behaviors
    27. Summary
    28. Exercises
  11. 4 Stream graph
    1. A credit card fraud detection system
    2. More about the credit card fraud detection system
    3. The fraud detection business
    4. Streaming isn’t always a straight line
    5. Zoom into the system
    6. The fraud detection job in detail
    7. New concepts
    8. Upstream and downstream components
    9. Stream fan-out and fan-in
    10. Graph, directed graph, and DAG
    11. DAG in stream processing systems
    12. All new concepts in one page
    13. Stream fan-out to the analyzers
    14. Look inside the engine
    15. There is a problem: Efficiency
    16. Stream fan-out with different streams
    17. Look inside the engine again
    18. Communication between the components via channels
    19. Multiple channels
    20. Stream fan-in to the score aggregator
    21. Stream fan-in in the engine
    22. A brief introduction to anotherstream fan-in: Join
    23. Look at the whole system
    24. Graph and streaming jobs
    25. The example systems
    26. Summary
    27. Exercises
  12. 5 Delivery semantics
    1. The latency requirement of the fraud detection system
    2. Revisit the fraud detection job
    3. About accuracy
    4. Partial result
    5. A new streaming job to monitor system usage
    6. The new system usage job
    7. The requirements of the new system usage job
    8. New concepts: (The number of) times delivered and times processed
    9. New concept: Delivery semantics
    10. Choosing the right semantics
    11. At-most-once
    12. The fraud detection job
    13. At-least-once
    14. At-least-once with acknowledging
    15. Track events
    16. Handle event processing failures
    17. Track early out events
    18. Acknowledging code in components
    19. New concept: Checkpointing
    20. New concept: State
    21. Checkpointing in the system usage job for the at-least-once semantic
    22. Checkpointing and state manipulation functions
    23. State handling code in the transaction source component
    24. Exactly-once or effectively-once?
    25. Bonus concept: Idempotent operation
    26. Exactly-once, finally
    27. State handling code in the system usage analyzer component
    28. Comparing the delivery semantics again
    29. Summary
    30. Exercises
    31. Up next ...
  13. 6 Streaming systems review and a glimpse ahead
    1. Streaming system pieces
    2. Parallelization and event grouping
    3. DAGs and streaming jobs
    4. Delivery semantics (guarantees)
    5. Delivery semantics used in the credit card fraud detection system
    6. Which way to go from here
    7. Windowed computations
    8. Joining data in real time
    9. Backpressure
    10. Stateless and stateful computations
  14. Part 2. Stepping up
  15. 7 Windowed computations
    1. Slicing up real-time data
    2. Breaking down the problem in detail
    3. Breaking down the problem in detail (continued)
    4. Two different contexts
    5. Windowing in the fraud detection job
    6. What exactly are windows?
    7. Looking closer into the window
    8. New concept: Windowing strategy
    9. Fixed windows
    10. Fixed windows in the windowed proximity analyzer
    11. Detecting fraud with a fixed time window
    12. Fixed windows: Time vs. count
    13. Sliding windows
    14. Sliding windows: Windowed proximity analyzer
    15. Detecting fraud with a sliding window
    16. Session windows
    17. Session windows (continued)
    18. Detecting fraud with session windows
    19. Summary of windowing strategies
    20. Slicing an event stream into data sets
    21. Windowing: Concept or implementation
    22. Another look
    23. Key–value store 101
    24. Implement the windowed proximity analyzer
    25. Event time and other times for events
    26. Windowing watermark
    27. Late events
    28. Summary
    29. Exercise
  16. 8 Join operations
    1. Joining emission data on the fly
    2. The emissions job version 1
    3. The emission resolver
    4. Accuracy becomes an issue
    5. The enhanced emissions job
    6. Focusing on the join
    7. What is a join again?
    8. How the stream join works
    9. Stream join is a different kind of fan-in
    10. Vehicle events vs. temperature events
    11. Table: A materialized view of streaming
    12. Vehicle events are less efficient to be materialized
    13. Data integrity quickly became an issue
    14. What’s the problem with this join operator?
    15. Inner join
    16. Outer join
    17. The inner join vs. outer join
    18. Different types of joins
    19. Outer joins in streaming systems
    20. A new issue: Weak connection
    21. Windowed joins
    22. Joining two tables instead of joining a stream and table
    23. Revisiting the materialized view
    24. Summary
  17. 9 Backpressure
    1. Reliability is critical
    2. Review the system
    3. Streamlining streaming jobs
    4. New concepts: Capacity, utilization, and headroom
    5. More about utilization and headroom
    6. New concept: Backpressure
    7. Measure capacity utilization
    8. Backpressure in the Streamwork engine
    9. Backpressure in the Streamwork engine: Propagation
    10. Our streaming job during a backpressure
    11. Backpressure in distributed systems
    12. New concept: Backpressure watermarks
    13. Another approach to handle lagging instances: Dropping events
    14. Why do we want to drop events?
    15. Backpressure could be a symptom when the underlying issue is permanent
    16. Stopping and resuming may lead to thrashing if the issue is permanent
    17. Handle thrashing
    18. Summary
  18. 10 Stateful computation
    1. The migration of the streaming jobs
    2. Stateful components in the system usage job
    3. Revisit: State
    4. The states in different components
    5. State data vs. temporary data
    6. Stateful vs. stateless components: The code
    7. The stateful source and operator in the system usage job
    8. States and checkpoints
    9. Checkpoint creation: Timing is hard
    10. Event-based timing
    11. Creating checkpoints with checkpoint events
    12. A checkpoint event is handled by instance executors
    13. A checkpoint event flowing through a job
    14. Creating checkpoints with checkpoint events at the instance level
    15. Checkpoint event synchronization
    16. Checkpoint loading and backward compatibility
    17. Checkpoint storage
    18. Stateful vs. stateless components
    19. Manually managed instance states
    20. Lambda architecture
    21. Summary
    22. Exercises
  19. 11 Wrap-up: Advanced concepts in streaming systems
    1. Is this really the end?
    2. Windowed computations
    3. The major window types
    4. Joining data in real time
    5. SQL vs. stream joins
    6. Inner joins vs. outer joins
    7. Unexpected things can happen in streaming systems
    8. Backpressure: Slow down sources or upstream components
    9. Another approach to handle lagging instances: Dropping events
    10. Backpressure can be a symptom when the underlying issue is permanent
    11. Stateful components with checkpoints
    12. Event-based timing
    13. Stateful vs. stateless components
    14. You did it!
  20. Appendix. Key concepts covered in this book
  21. index

Product information

  • Title: Grokking Streaming Systems
  • Author(s): Ning Wang, Josh Fischer
  • Release date: March 2022
  • Publisher(s): Manning Publications
  • ISBN: 9781617297304