Every data set poses challenges and potential pitfalls when it comes to data quality and data accuracy. The challenges faced in Makeover Monday can be grouped into five main categories:
- Dealing with missing or incomplete data
- Overcounting data
- Sense-checking data
- Is the data aggregable?
- Substantiating claims with data
This chapter will look into each of these points in detail, with specific examples that demonstrate common mistakes people are likely to make and how to avoid and correct those mistakes.
Working with Incomplete Data
Data sets, especially those publicly available, need to be looked at with a critical lens. Data could be missing, the range of the data set may be incomplete, or data might be duplicated. How should you handle these situations? How do you identify these problems in the first place? Once you have identified the problems, is it safe to use incomplete data for comparisons?
When we provided a data set about the number of iPhones sold over time, many people looked at units sold over time, similar to Figure 2.1.
Displaying the data by year shows an annual increase from 2008 to 2015 and then a decrease in 2016. Maybe you then want to look at the year over year change, as in Figure 2.2.
Many people saw ...