Chapter 18. Handling Bad Data in Event Streams
At a high level, bad data is data that doesn’t conform to what is expected; for example, an email address without the @ or a credit card expiry where the MM/YYYY format is swapped to YYYY/MM. Bad can also include malformed and corrupted data, such that it’s completely indecipherable and effectively garbage. This chapter covers how bad data can come to be, and how you can deal with it when it comes to event streams.
Event streams are predicated on an immutable log, where data, once written, cannot be edited or deleted (outside of expiry or compaction—more on this later in the chapter). Despite all the benefits of the immutable log, the downside is that it makes it trickier to deal with bad data. You can’t simply reach in and edit it once it’s produced to the stream, like you could do with data in a mutable data store.
There is no one successful way to handle bad data in event streams. Instead, you’ll need to rely on a set of strategies to prevent, mitigate, and fix bad data in streams. The most successful strategies for mitigating and fixing bad data in streams include, in order:
- Prevention
-
Prevent bad data from entering the stream in the first place: use schemas, testing, and validation rules. Fail fast and gracefully when data is incorrect.
- Event design
-
Use event designs that let you issue corrections, overwriting previous bad data.
- Rewind, rebuild, and retry
-
For when all else fails.
To properly discuss these three options, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access