CHAPTER 4Argue with the Data
“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”
—John Tukey, famous statistician
As you become a Data Head, your job is to demonstrate leadership in asking questions about the data used in a project.
We're talking about the underlying raw data—the raw material—from which all statistics are calculated, machine learning models built, or dashboard visualizations created. This is the data stored in your spreadsheets or databases. If the raw data is bad, no amount of data cleaning wizardry, statistical methodology, or machine learning can hide the stench. Therefore, we can best summarize this chapter with a phrase you may have heard before, “garbage in, garbage out.” In this chapter, we lay out the types of questions you should ask to find out if your data stinks.
We have identified three main prompts or questions to ask to help you argue with the data. Within those questions we offer additional follow-up questions.
- Tell me the data origin story.
- Who collected the data?
- How was the data collected?
- Is the data representative?
- Is there sampling bias?
- What did you do with outliers?
- What data am I not seeing?
- How did you deal with missing values?
- Can the data measure what you want it to measure?
In the sections that follow, we'll present each question, why you should ask it, and what issues it often uncovers.
Before we do that, however, let's ...
Get Becoming a Data Head now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.