Chapter 2. Data Quality

80% of my time was spent cleaning the data. Better data will always beat better models.

Thomson Nguyen

Data is the foundation of a data-driven organization.

If you don’t have timely, relevant, and trustworthy data, decision-makers have no alternative other than to make decisions by gut. Data quality is key.


In this chapter, I’m using “quality” in a very broad sense, considering it primarily from an analyst’s perspective.

Analysts need the right data, collected in the right manner, in the right form, in the right place, and at the right time. (They are not asking for much.) If any of those aspects are missing or lacking, analysts are limited in the questions that they can answer and the type or quality of insights that they can derive from the data.

In this chapter and the next, I will cover this broad topic of data quality. First, I’ll discuss how to ensure that the data collection process is right. This is quality in the sense that it is accurate, timely, coherent, etc. Then, in the next chapter, I’ll cover how to make sure that we are collecting the right data. This is quality in the sense of choosing and supplying the best data sources to augment existing data and so enable better insights. In short, I’ll cover collecting the data right followed by collecting the right data.

This chapter focuses on the ways that we know that data is reliable, and all the ways that it can be unreliable. I’ll first cover the facets of data quality—all the attributes ...

Get Creating a Data-Driven Organization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.