Chapter 4. How Do You Analyze Infinite Data Sets?
Infinite data sets raise important questions about how to do certain operations when you don’t have all the data and never will. In particular, what do classic SQL operations like GROUP BY and JOIN mean in this context?
A theory of streaming semantics is emerging to answer questions like these. Central to this theory is the idea that operations like GROUP BY and JOIN are now based on snapshots of the data available at points in time.
Apache Beam, formerly known as Google Dataflow, is perhaps the best-known mature streaming engine that offers a sophisticated formulation of these semantics. It has become the de facto standard for how precise analytics can be performed in real-world streaming scenarios. A third-party “runner” is required to execute Beam dataflows. In the open source world, teams are implementing this functionality for Flink, Gearpump, and Spark Streaming, while Google’s own runner is its cloud service, Cloud Dataflow. This means you will soon be able to write Beam dataflows and run them with these tools, or you will be able to use the native Flink, Gearpump, or Spark Streaming APIs to write dataflows with the same behaviors.
For space reasons, I can only provide a sketch of these semantics here, but two O’Reilly Radar blog posts by Tyler Akidau, a leader of the Beam/Dataflow team, cover them in depth.1 If you follow no other links in this report, at least read those blog posts!
Suppose we set out to build our own ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access