Chapter 3. Understanding Data Quality

Data is everywhere. It’s automatically generated by our mobile devices, our shopping activities, and our physical movements. It’s captured by our electric meters, public transportation systems, and communications infrastructure. And it’s used to estimate our health outcomes, our earning potential, and our credit worthiness.1 Economists have even declared that data is the “new oil,”2 given its potential to transform so many aspects of human life.

While data may be plentiful, however, the truth is that good data is scarce. The claim of “the data revolution” is that, with enough data, we can better understand the present and improve—or even predict—the future. For any of that to even be possible, however, the data underlying those insights has to be high quality. Without good-quality data, all of our efforts to wrangle, analyze, visualize, and communicate it will, at best, leave us with no more insight about the world than when we started. While that would be an unfortunate waste of effort, the consequences of failing to recognize that we have poor-quality data is even worse, because it can lead us to develop a seemingly rational but dangerously distorted view of reality. What’s more, because data-driven systems are used to make decisions at scale, the harms caused by even a small amount of bad data can be significant. Sure, data about hundreds or even thousands of people may be used to “train” a machine learning model. But if that data is not ...

Get Practical Python Data Wrangling and Data Quality now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.