Skip to Content
Fast Data Architectures for Streaming Applications
book

Fast Data Architectures for Streaming Applications

by Dean Wampler
October 2016
Beginner to intermediate
43 pages
50m
English
O'Reilly Media, Inc.
Content preview from Fast Data Architectures for Streaming Applications

Chapter 4. How Do You Analyze Infinite Data Sets?

Infinite data sets raise important questions about how to do certain operations when you don’t have all the data and never will. In particular, what do classic SQL operations like GROUP BY and JOIN mean in this context?

A theory of streaming semantics is emerging to answer questions like these. Central to this theory is the idea that operations like GROUP BY and JOIN are now based on snapshots of the data available at points in time.

Apache Beam, formerly known as Google Dataflow, is perhaps the best-known mature streaming engine that offers a sophisticated formulation of these semantics. It has become the de facto standard for how precise analytics can be performed in real-world streaming scenarios. A third-party “runner” is required to execute Beam dataflows. In the open source world, teams are implementing this functionality for Flink, Gearpump, and Spark Streaming, while Google’s own runner is its cloud service, Cloud Dataflow. This means you will soon be able to write Beam dataflows and run them with these tools, or you will be able to use the native Flink, Gearpump, or Spark Streaming APIs to write dataflows with the same behaviors.

For space reasons, I can only provide a sketch of these semantics here, but two O’Reilly Radar blog posts by Tyler Akidau, a leader of the Beam/Dataflow team, cover them in depth.1 If you follow no other links in this report, at least read those blog posts!

Suppose we set out to build our own ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Fast Data Architectures for Streaming Applications, 2nd Edition

Fast Data Architectures for Streaming Applications, 2nd Edition

Dean Wampler
Designing Fast Data Application Architectures

Designing Fast Data Application Architectures

Gerard Maas, Stavros Kontopoulos, Sean Glover
Event Streams in Action

Event Streams in Action

Valentin Crettaz, Alexander Dean

Publisher Resources

ISBN: 9781492038771