book

Streaming Systems

by Tyler Akidau, Slava Chernyak, Reuven Lax

July 2018

Beginner to intermediate

349 pages

10h 8m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface Or: What Are You Getting Yourself Into Here?
Navigating This BookTakeawaysConventions Used in This BookOnline ResourcesFiguresCode SnippetsO’Reilly SafariHow to Contact UsAcknowledgments
I. The Beam Model
1. Streaming 101
Terminology: What Is Streaming?On the Greatly Exaggerated Limitations of StreamingEvent Time Versus Processing TimeData Processing PatternsBounded DataUnbounded Data: BatchUnbounded Data: StreamingSummary
2. The What, Where, When, and How of Data Processing
RoadmapBatch Foundations: What and WhereWhat: TransformationsWhere: WindowingGoing Streaming: When and HowWhen: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things!When: WatermarksWhen: Early/On-Time/Late Triggers FTW!When: Allowed Lateness (i.e., Garbage Collection)How: AccumulationSummary
3. Watermarks
DefinitionSource Watermark CreationPerfect Watermark CreationHeuristic Watermark CreationWatermark PropagationUnderstanding Watermark PropagationWatermark Propagation and Output TimestampsThe Tricky Case of Overlapping WindowsPercentile WatermarksProcessing-Time WatermarksCase StudiesCase Study: Watermarks in Google Cloud DataflowCase Study: Watermarks in Apache FlinkCase Study: Source Watermarks for Google Cloud Pub/SubSummary
4. Advanced Windowing
When/Where: Processing-Time WindowsEvent-Time WindowingProcessing-Time Windowing via TriggersProcessing-Time Windowing via Ingress TimeWhere: Session WindowsWhere: Custom WindowingVariations on Fixed WindowsVariations on Session WindowsOne Size Does Not Fit AllSummary
5. Exactly-Once and Side Effects
Why Exactly Once MattersAccuracy Versus CompletenessSide EffectsProblem DefinitionEnsuring Exactly Once in ShuffleAddressing DeterminismPerformanceGraph OptimizationBloom FiltersGarbage CollectionExactly Once in SourcesExactly Once in SinksUse CasesExample Source: Cloud Pub/SubExample Sink: FilesExample Sink: Google BigQueryOther SystemsApache Spark StreamingApache FlinkSummary
II. Streams and Tables
6. Streams and Tables
Stream-and-Table Basics Or: a Special Theory of Stream and Table RelativityToward a General Theory of Stream and Table RelativityBatch Processing Versus Streams and TablesA Streams and Tables Analysis of MapReduceReconciling with Batch ProcessingWhat, Where, When, and How in a Streams and Tables WorldWhat: TransformationsWhere: WindowingWhen: TriggersHow: AccumulationA Holistic View of Streams and Tables in the Beam ModelA General Theory of Stream and Table RelativitySummary
7. The Practicalities of Persistent State
MotivationThe Inevitability of FailureCorrectness and EfficiencyImplicit StateRaw GroupingIncremental CombiningGeneralized StateCase Study: Conversion AttributionConversion Attribution with Apache BeamSummary

8. Streaming SQL
What Is Streaming SQL?Relational AlgebraTime-Varying RelationsStreams and TablesLooking Backward: Stream and Table BiasesThe Beam Model: A Stream-Biased ApproachThe SQL Model: A Table-Biased ApproachLooking Forward: Toward Robust Streaming SQLStream and Table SelectionTemporal OperatorsSummary
9. Streaming Joins
All Your Joins Are Belong to StreamingUnwindowed JoinsFULL OUTERLEFT OUTERRIGHT OUTERINNERANTISEMIWindowed JoinsFixed WindowsTemporal ValiditySummary
10. The Evolution of Large-Scale Data Processing
MapReduceHadoopFlumeStormSparkMillWheelKafkaCloud DataflowFlinkBeamSummary
Index
About the Authors

Content preview from Streaming Systems

Chapter 2. The What, Where, When, and How of Data Processing

Okay party people, it’s time to get concrete!

Chapter 1 focused on three main areas: terminology, defining precisely what I mean when I use overloaded terms like “streaming”; batch versus streaming, comparing the theoretical capabilities of the two types of systems, and postulating that only two things are necessary to take streaming systems beyond their batch counterparts: correctness and tools for reasoning about time; and data processing patterns, looking at the conceptual approaches taken with both batch and streaming systems when processing bounded and unbounded data.

In this chapter, we’re now going to focus further on the data processing patterns from Chapter 1, but in more detail, and within the context of concrete examples. By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

To give you a sense of what things look like in action, I use snippets of Apache Beam code, coupled with time-lapse diagrams¹ to provide a visual representation of the concepts. Apache Beam is a unified programming model and portability layer for batch and stream processing, with a set of concrete SDKs in various languages (e.g., Java and Python). Pipelines written with Apache Beam can then be portably run on any of the supported execution ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491983867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Streaming Systems

by Tyler Akidau, Slava Chernyak, Reuven Lax

Chapter 2. The What, Where, When, and How of Data Processing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.