Chapter 83. Time (Semantics) Won’t Wait

Marta Paes Moreira and Fabian Hueske

Data pipelines are evolving from storing continuously arriving data and processing it as bounded batches, to streaming approaches that continuously ingest and process unbounded streams of data. Usually, the goal is to reduce the latency between the time when data is received and when it is processed.

An important difference between batch and stream processing is the notion of completeness. In batch processing, data is always considered complete (as defined by the input), but stream-processing applications need to reason about the completeness of their input when ingesting unbounded streams of data. For example, a common task is to compute aggregates for regular time intervals, like counting the number of click events per hour. When implementing such a stream-processing application, you need to decide when to start and stop counting (i.e., which count to increment for an event).

The most straightforward approach is to count events based on the system time of the machine that runs the computation. This approach is commonly called processing time. While it is easy to implement, it has a bunch of undesirable properties, including these:

  • The results are nondeterministic and depend on several external factors, ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.