book

Streaming Systems

by Tyler Akidau, Slava Chernyak, Reuven Lax

July 2018

Beginner to intermediate

349 pages

10h 8m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface Or: What Are You Getting Yourself Into Here?
Navigating This BookTakeawaysConventions Used in This BookOnline ResourcesFiguresCode SnippetsO’Reilly SafariHow to Contact UsAcknowledgments
I. The Beam Model
1. Streaming 101
Terminology: What Is Streaming?On the Greatly Exaggerated Limitations of StreamingEvent Time Versus Processing TimeData Processing PatternsBounded DataUnbounded Data: BatchUnbounded Data: StreamingSummary
2. The What, Where, When, and How of Data Processing
RoadmapBatch Foundations: What and WhereWhat: TransformationsWhere: WindowingGoing Streaming: When and HowWhen: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things!When: WatermarksWhen: Early/On-Time/Late Triggers FTW!When: Allowed Lateness (i.e., Garbage Collection)How: AccumulationSummary
3. Watermarks
DefinitionSource Watermark CreationPerfect Watermark CreationHeuristic Watermark CreationWatermark PropagationUnderstanding Watermark PropagationWatermark Propagation and Output TimestampsThe Tricky Case of Overlapping WindowsPercentile WatermarksProcessing-Time WatermarksCase StudiesCase Study: Watermarks in Google Cloud DataflowCase Study: Watermarks in Apache FlinkCase Study: Source Watermarks for Google Cloud Pub/SubSummary
4. Advanced Windowing
When/Where: Processing-Time WindowsEvent-Time WindowingProcessing-Time Windowing via TriggersProcessing-Time Windowing via Ingress TimeWhere: Session WindowsWhere: Custom WindowingVariations on Fixed WindowsVariations on Session WindowsOne Size Does Not Fit AllSummary
5. Exactly-Once and Side Effects
Why Exactly Once MattersAccuracy Versus CompletenessSide EffectsProblem DefinitionEnsuring Exactly Once in ShuffleAddressing DeterminismPerformanceGraph OptimizationBloom FiltersGarbage CollectionExactly Once in SourcesExactly Once in SinksUse CasesExample Source: Cloud Pub/SubExample Sink: FilesExample Sink: Google BigQueryOther SystemsApache Spark StreamingApache FlinkSummary
II. Streams and Tables
6. Streams and Tables
Stream-and-Table Basics Or: a Special Theory of Stream and Table RelativityToward a General Theory of Stream and Table RelativityBatch Processing Versus Streams and TablesA Streams and Tables Analysis of MapReduceReconciling with Batch ProcessingWhat, Where, When, and How in a Streams and Tables WorldWhat: TransformationsWhere: WindowingWhen: TriggersHow: AccumulationA Holistic View of Streams and Tables in the Beam ModelA General Theory of Stream and Table RelativitySummary
7. The Practicalities of Persistent State
MotivationThe Inevitability of FailureCorrectness and EfficiencyImplicit StateRaw GroupingIncremental CombiningGeneralized StateCase Study: Conversion AttributionConversion Attribution with Apache BeamSummary

8. Streaming SQL
What Is Streaming SQL?Relational AlgebraTime-Varying RelationsStreams and TablesLooking Backward: Stream and Table BiasesThe Beam Model: A Stream-Biased ApproachThe SQL Model: A Table-Biased ApproachLooking Forward: Toward Robust Streaming SQLStream and Table SelectionTemporal OperatorsSummary
9. Streaming Joins
All Your Joins Are Belong to StreamingUnwindowed JoinsFULL OUTERLEFT OUTERRIGHT OUTERINNERANTISEMIWindowed JoinsFixed WindowsTemporal ValiditySummary
10. The Evolution of Large-Scale Data Processing
MapReduceHadoopFlumeStormSparkMillWheelKafkaCloud DataflowFlinkBeamSummary
Index
About the Authors

Content preview from Streaming Systems

Chapter 10. The Evolution of Large-Scale Data Processing

You have now arrived at the final chapter in the book, you stoic literate, you. Your journey will soon be complete!

To wrap things up, I’d like you to join me on a brief stroll through history, starting back in the ancient days of large-scale data processing with MapReduce and touching upon some of the highlights over the ensuing decade and a half that have brought streaming systems to the point they’re at today. It’s a relatively lightweight chapter in which I make a few observations about important contributions from a number of well-known systems (and a couple maybe not-so-well known), refer you to a bunch of source material you can go read on your own should you want to learn more, all while attempting not to offend or inflame the folks responsible for systems whose truly impactful contributions I’m going to either oversimplify or ignore completely for the sake of space, focus, and a cohesive narrative. Should be a good time.

On that note, keep in mind as you read this chapter that we’re really just talking about specific pieces of the MapReduce/Hadoop family tree of large-scale data processing here. I’m not covering the SQL arena in any way shape or form¹; we’re not talking HPC/supercomputers, and so on. So as broad and expansive as the title of this chapter might sound, I’m really focusing on a specific vertical swath of the grand universe of large-scale data processing. Caveat literatus, and all that.

Also note that ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491983867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Streaming Systems

by Tyler Akidau, Slava Chernyak, Reuven Lax

Chapter 10. The Evolution of Large-Scale Data Processing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.