Chapter 7. Be Intentional About the Batching Model in Your Data Pipelines
Raghotham Murthy
If you are ingesting data records in batches and building batch data pipelines, you will need to choose how to create the batches over a period of time. Batches can be based on the data_timestamp
or the arrival_timestamp
of the record. The data_timestamp
is the last updated timestamp included in the record itself. The arrival_timestamp
is the timestamp attached to the record depending on when the record was received by the processing system.
Data Time Window Batching Model
In the data time window (DTW) batching model, a batch is created for a time window when all records with a data_timestamp
in that window have been received. Use this batching model when:
Data is being pulled from (versus being pushed by) the source.
The extraction logic can filter out records with a
data_timestamp
outside the time window.
For example, use DTW batching when extracting all transactions within a time window from a database. DTW batching makes the analyst’s life easier with analytics since there can be a guarantee that all records for a given time window are present in that batch. So, the analyst knows exactly what data they are working with. But DTW batching is not very predictable since out-of-order records could result in delays.
Arrival Time Window Batching Model
In the arrival time window (ATW) batching ...
Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.