Chapter 1. The Data Science Lifecycle

Data science is a rapidly evolving field. At the time of this writing, people are still trying to pin down exactly what data science is, what data scientists do, and what skills data scientists should have. What we do know, though, is that data science uses a combination of methods and principles from statistics and computer science to work with and draw insights from data. And learning computer science and statistics in combination makes us better data scientists. We also know that any insights we glean need to be interpreted in the context of the problem that we are working on.

This book covers fundamental principles and skills that data scientists need to help make all sorts of important decisions. With both technical skills and conceptual understanding we can work on data-centric problems to, say, assess whether a vaccine works, filter out fake news automatically, calibrate air quality sensors, and advise analysts on policy changes.

To help you keep track of the bigger picture, we’ve organized topics around a workflow that we call the data science lifecycle. In this chapter, we introduce this lifecycle. Unlike other data science books, which tend to focus on one part of the lifecycle or address only computational or statistical topics, we cover the entire cycle from start to finish and consider both statistical and computational aspects together.

The Stages of the Lifecycle

Figure 1-1 shows the data science lifecycle, which is divided into four stages: Ask a Question, Obtain Data, Understand the Data, and Understand the World. We’ve purposefully made these stages broad. In our experience, the mechanics of the lifecycle change frequently. Computer scientists and statisticians continue to build new software packages and programming languages for working with data, and they develop new methodologies that are more specialized.

Figure 1-1. The four high-level stages of the data science lifecycle with arrows indicating how the stages can lead into one another

Despite these changes, we’ve found that almost every data project consists of these four stages:

Ask a Question

Asking good questions is at the heart of data science, and recognizing different kinds of questions guides us in our analyses. We cover four categories of questions: descriptive, exploratory, inferential, and predictive. For example, “How have house prices changed over time?” is descriptive in nature, whereas “Which aspects of houses are related to sale price?” is exploratory. Narrowing down a broad question into one that can be answered with data is a key element of this first stage in the lifecycle. It can involve consulting the people participating in a study, figuring out how to measure something, and designing data collection protocols. A clear and focused research question helps us determine the data we need, the patterns to look for, and how to interpret results. It can also help us refine our question, recognize the type of question being asked, and plan the data collection phase of the lifecycle.

Obtain Data

When data are expensive and hard to gather and when our goal is to generalize from the data to the world, we aim to define precise protocols for collecting the data. Other times, data are cheap and easily accessed. This is especially true for online data sources. For example, Twitter lets people quickly download millions of data points. When data are plentiful, we can start an analysis by obtaining and exploring the data, and then honing a research question. In both situations, most data have missing or unusual values and other anomalies that we need to account for. No matter the source, we need to check the data quality. Considering the scope of the data is equally important; for example, we identify how representative the data are and look for potential sources of bias in the collection process. These considerations help us determine how much faith we can place in our findings. And, typically, we must manipulate the data before we can analyze it more formally. We may need to modify structure, clean data values, and transform measurements to prepare for analysis.

Understand the Data

After obtaining and preparing data, we want to carefully examine them, and exploratory data analysis is often key. In our explorations, we make plots to uncover interesting patterns and summarize the data visually. We also continue to look for problems with the data. As we search for patterns and trends, we use summary statistics and build statistical models, like linear and logistic regression. In our experience, this stage of the lifecycle is highly iterative. Understanding the data can also lead us back to earlier stages in the data science lifecycle. We may find that we need to modify or redo the data cleaning and manipulation, acquire more data to supplement our analysis, or refine our research question given the limitations of the data. The descriptive and exploratory analyses that we carry out in this stage may adequately answer our question, or we may need to go on to the next stage in order to make generalizations beyond our data.

Understand the World

When our goals are purely descriptive or exploratory, the analysis ends at the Understand the Data stage of the lifecycle. At other times, we aim to quantify how well the trends we find generalize beyond our data. We may want to use a model that we have fit to our data to make inferences about the world or give predictions for future observations. To draw inferences from a sample to a population, we use statistical techniques like A/B testing and confidence intervals. And to make predictions for future observations, we create prediction intervals and use train-test splits of the data.

For each stage of the lifecycle, we explain theoretical concepts, introduce data technologies and statistical methodologies, and show how they work in practical examples. Throughout, we rely on authentic data and analyses by other data scientists, not made-up data, so you can learn how to perform your own data acquisition, cleaning, exploration, and formal analyses, and draw sound conclusions. Each chapter in this book tends to focus on one stage of the data science lifecycle, but we also include chapters with case studies that demonstrate the full lifecycle.

Note

Understanding the differences between exploration, inference, prediction, and causation can be a challenge. We can easily slip into confusing a correlation found in data with a causal relationship. For example, an exploratory or inferential analysis might look for correlations in response to the question “Do people who have a greater exposure to air pollution have a higher rate of lung disease?” Whereas a causal question might ask “Does giving an award to a Wikipedia contributor increase productivity?” We typically cannot answer causal questions unless we have a randomized experiment (or approximate one). We point out these important distinctions throughout the book.

Examples of the Lifecycle

Several case studies that address the entire data science lifecycle are placed throughout this book. These cases serve double duty. They focus on one stage in the lifecycle to provide a specific example of the topics in the part of the book where they are located, and they also demonstrate the entire cycle.

The focus of Chapter 5 is on the interplay between a question of interest and how data can be used to answer the question. The simple question “Why is my bus always late?” provides a rich case study that is basic enough for the beginning data scientist to track the stages of the lifecycle, and yet nuanced enough to demonstrate how we apply both statistical and computational thinking to answer the question. In this case study, we build a simulation study to inform us about the distribution of wait times for riders. And we fit a simple model to summarize the wait times with a statistic. This case study also demonstrates how, as a data scientist, you can collect your own data to answer questions that interest you.

Chapter 12 studies the accuracy of mass-market air sensors that are used across the United States. We devise a way to leverage data from highly accurate sensors maintained by the Environmental Protection Agency to improve readings from less expensive sensors. This case study shows how crowdsourced, open data can be improved with data from rigorously maintained, precise, government-monitored equipment. In the process, we focus on cleaning and merging data from multiple sources, but we also fit models to adjust and improve air quality measurements.

In Chapter 18 our focus is on model building and prediction. But we cover the full lifecycle and see how the question of interest impacts the model that we build. Our aim is to enable veterinarians in rural Kenya, who have no access to a scale to weigh a donkey, to prescribe medication for a sick animal. As we learn about the design of the study, clean the data, and balance simplicity with accuracy, we assess the predictive capabilities of our model and show how scientists can partner with people facing practical problems and assist them with solutions.

Finally, in Chapter 21 we examine hand-classified news stories in an effort to algorithmically differentiate fake news from real news. In this case study, we again see how readily accessible information creates amazing opportunities for data scientists to develop new technologies and investigate today’s important problems. These data have been scraped from news stories on the web and classified as fake or real news by people reading the stories. We also see how data scientists thinking creatively can take general information, such as the content of a news article, and transform it into analyzable data to address topical questions.

Summary

The data science lifecycle provides an organizing structure for this book. We keep the lifecycle in mind as we work with many datasets from a wide range of sources, including science, medicine, politics, social media, and government. The first time we use a dataset, we provide the context in which the data were collected, the question of interest in examining the data, and descriptions needed to understand the data. In this way, we aim to practice good data science throughout the book.

The first stage of the lifecycle—asking a question—is often seen in books as a question that requires an application of a technique to get a number, such as “What’s the p-value for this A/B test?” Or a vague question that is often seen in practice, like “Can we restore the American Dream?” Answering the first sort of question gives little practice in developing a research question. Answering the second is hard to do without guidance on how to turn a general area of interest into a question that can be answered with data. The interplay between asking a question and understanding the limitations of data to answer it is the topic of the next chapter.

Get Learning Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.