A Guide to Improving Data Integrity and Adoption

In most companies, quality data is crucial to measuring success and planning for business goals. Unlike sample datasets in classes and examples, real data is messy and requires processing and effort to be utilized, maintained, and trusted. How do we know if the data is accurate or whether we can trust final conclusions? What steps can we take to not only ensure that all of the data is transformed correctly, but also to verify that the source data itself can be trusted as accurate? How can we motivate others to treat data and its accuracy as priority? What can we do to expand adoption of data?

Validating Data Integrity as an Integral Part of Business

Data can be messy for many reasons. Unstructured data such as log files can be complicated to understand and parse information. A lot of data, even when structured, is still not standardized. For example, parsing text from online forums can be complicated and might need to include logic to accommodate slang such as “bad ass,” which is a positive phrase but made with negative words. The system creating the data can also make it messy because different languages have different expectations for design, such as Ruby on Rails, which requires a separate table to represent many-to-many relationships.

Implementation or design can also lead to messy data. For example, the process or code that creates data, and the database storing that data might use incompatible formats. Or, the code might ...

Get A Guide to Improving Data Integrity and Adoption now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.