Chapter 2. Questions and Data Scope

As data scientists, we use data to answer questions, and the quality of the data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general approach for understanding data collection and evaluating the usefulness of the data in addressing the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is a population characteristic, a physical model, or some type of social behavior. Typically, our data do not contain complete information (the scope is restricted in some way), yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong.

To motivate the need to think about these issues, we begin with an example of the power of big data and what can go wrong. We then provide a framework that can help you connect the goal of your study (your question) with the data collection process. We refer to this as the data scope,1 and we provide terminology to help describe data scope, along with examples from surveys, government data, scientific instruments, and online ...

Get Learning Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.