ChapterÂ 17.Â Data Traceability
Your software consistently provides impressive music recommendations by combining cultural and audio data. Customers are happy. However, things arenât always perfect. Sometimes that BeyoncÃ© track is attributed to Beyonce. The artist for the BÃ©la Fleck solo album shows up as BÃ©la Fleck and the Flecktones. Worse, the ããªã¹ biography has the artist name listed as ???. Where did things go wrong? Did one of your customers provide you with data in an incorrect character encoding? Did one of the web-crawlers have a bug? Perhaps the name resolution code was incorrectly combining a solo artist with his band?
How do we solve this problem? Weâd like to be able to trace data back to its origin, following each transformation. This is reified as data provenenace. In this chapter, weâll explore ways of keeping track of the source of our data, techniques for backing out bad data, and the business value of adopting this ability.
The ability to trace a datum back to its origin is important for several reasons. It helps us to back-out or reprocess bad data, and conversely, it allows us to reward and boost good data sources and processing techniques. Furthermore, local privacy laws can mandate things like auditability, data transfer restrictions, and more. For example, Californiaâs Shine the Light Law requires businesses disclose the personal information that has been shared with third-parties, should a resident request. Europeâs Data Protection ...