Chapter 6. Handling Missing Data

Missing data is a common occurrence in data analysis. In the age of big data, many authors and even more practitioners treat it as a minor annoyance that is given scant thought: just filter out the rows with missing data—if you go from 12 million rows to 11 million, what’s the big deal? That still leaves you with plenty of data to run your analyses.

Unfortunately, filtering out the rows with missing data can introduce significant biases in your analysis. Let’s say that older customers are more likely to have missing data, for example because they are less likely to set up automated payments; by filtering these customers out you would bias your analysis toward younger customers, who would be overrepresented in your filtered data. Other common methods to handle missing data, such as replacing them by the average value for that variable, also introduce biases of their own.

Statisticians and methodologists have developed methods that have much smaller or even no bias. These methods have not been adopted broadly by practitioners yet, but hopefully this chapter will help you get ahead of the curve!

The theory of missing values is rooted in statistics and can easily get very mathematical. To make our journey in this chapter more concrete, we’ll work through a simulated data set for AirCnC. The business context is that the marketing department, in an effort to better understand customer characteristics and motivations, has sent out by email a survey to ...

Get Behavioral Data Analysis with R and Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.