Chapter 4. Idempotency Design Patterns
Each data engineering activity eventually leads to errors—you already know that from the previous chapter. Thankfully, correctly implemented error management design patterns address most of the issues. Yes, you read that correctly: most, not all. But why?
Let’s take a look at an example of an automatic recovery from a temporary failure. From the engineering standpoint, that’s a great feature as you don’t have anything to do besides configuring the number of attempts to retry. However, from the data perspective, this great feature brings a serious challenge for consistency. A retried task or job might replay already successful write operations in the target data store, leading to duplication in the best-case scenario. You read that right: duplication is the best-case scenario because duplicates can be removed on the consumer’s side. But let’s imagine the contrary. The retried item generates duplicates that cannot be removed because you can’t even tell they represent the same data! Welcome to your nightmare and bad publicity for your dataset.
Hopefully, you can mitigate these issues with the idempotency design patterns presented in this chapter. But before you see how they apply to data engineering, let’s recall the idempotency definition. The best example to explain it is the absolute function. You know, it’s the simple method that returns a positive number even if the input argument is a negative number. Why is it idempotent? Because no matter ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access