Chapter 17. Data Traceability

Reid Draper

Your software consistently provides impressive music recommendations by combining cultural and audio data. Customers are happy. However, things aren’t always perfect. Sometimes that Beyoncé track is attributed to Beyonce. The artist for the Béla Fleck solo album shows up as Béla Fleck and the Flecktones. Worse, the ボリス biography has the artist name listed as ???. Where did things go wrong? Did one of your customers provide you with data in an incorrect character encoding? Did one of the web-crawlers have a bug? Perhaps the name resolution code was incorrectly combining a solo artist with his band?

How do we solve this problem? We’d like to be able to trace data back to its origin, following each transformation. This is reified as data provenenace. In this chapter, we’ll explore ways of keeping track of the source of our data, techniques for backing out bad data, and the business value of adopting this ability.


The ability to trace a datum back to its origin is important for several reasons. It helps us to back-out or reprocess bad data, and conversely, it allows us to reward and boost good data sources and processing techniques. Furthermore, local privacy laws can mandate things like auditability, data transfer restrictions, and more. For example, California’s Shine the Light Law requires businesses disclose the personal information that has been shared with third-parties, should a resident request. Europe’s Data Protection ...

Get Bad Data Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.