Chapter 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

Ken Gleason and Q. Ethan McCallum

Most of the decisions we make in our personal and professional lives begin with a query. That query might be for a presentation, a research project, a business forecast or simply finding the optimal combination of shipping time and price on tube socks. There are times when we are intuitively comfortable with our data source, and/or are not overly concerned about the breadth or depth of the answers we get, for instance, when we are looking at movie reviews. Other times, you might care a little more, for example, if you are estimating your requirements for food and water for the Badwater Ultramarathon.[76] Or even for mundane things like figuring out how much of a product to make, or where your production bottlenecks are on the assembly line.

But how do we know when to care and when not to care, and about what? Should you throw away the survey data because a couple of people failed to answer certain questions? Should you blindly accept that your daily sales of widgets in Des Moines seem to quintuple on alternate Fridays? Maybe, maybe not. Much of what you (think you) know about the quality of a given set of data relies on past experience that evolves to intuition. But there are three problems with relying solely on intuition. First, intuition is good at trapping obvious outliers (errors that stick out visibly) but likely won’t do much to track more subtle issues. ...

Get Bad Data Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.