Chapter 2 Data Quality and Accuracy

Every data set poses challenges and potential pitfalls when it comes to data quality and data accuracy. The challenges faced in Makeover Monday can be grouped into five main categories:

  1. Dealing with missing or incomplete data
  2. Overcounting data
  3. Sense-checking data
  4. Is the data aggregable?
  5. Substantiating claims with data

This chapter will look into each of these points in detail, with specific examples that demonstrate common mistakes people are likely to make and how to avoid and correct those mistakes.

Working with Incomplete Data

Data sets, especially those publicly available, need to be looked at with a critical lens. Data could be missing, the range of the data set may be incomplete, or data might be duplicated. How should you handle these situations? How do you identify these problems in the first place? Once you have identified the problems, is it safe to use incomplete data for comparisons?

Incomplete Data

When we provided a data set about the number of iPhones sold over time, many people looked at units sold over time, similar to Figure 2.1.

Bar graph shows year from 2007 to 2016 versus units sold from 0M to 240M for global iPhone sales where bars keep increasing in height and is highest at 2015 between 220M and 240M on number of units sold. Bar is lowest at 2007 which is below 20M on number of units sold. Bar at 2016 is of different shade than rest.

Figure 2.1 iPhone units sold by year.

Displaying the data by year shows an annual increase from 2008 to 2015 and then a decrease in 2016. Maybe you then want to look at the year over year change, as in Figure 2.2.

Figure 2.2 Year over year change of iPhone units sold.

Many people saw ...

Get #MakeoverMonday now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.