Chapter 4. Missing Data

We need to deal with missing data. The previous chapter showed an example. This chapter will dive into it a bit more. Most algorithms will not work if data is missing. Notable exceptions are the recent boosting libraries: XGBoost, CatBoost, and LightGBM.

As with many things in machine learning, there are no hard answers for how to treat missing data. Also, missing data could represent different situations. Imagine census data coming back and an age feature being reported as missing. Is it because the sample didn’t want to reveal their age? They didn’t know their age? The one asking the questions forgot to even ask about age? Is there a pattern to missing ages? Does it correlate to another feature? Is it completely random?

There are also various ways to handle missing data:

  • Remove any row with missing data

  • Remove any column with missing data

  • Impute missing values

  • Create an indicator column to signify data was missing

Examining Missing Data

Let’s go back to the Titanic data. Because Python treats True and False as 1 and 0, respectively, we can use this trick in pandas to get percent of missing data:

>>> df.isnull().mean() * 100
pclass        0.000000
survived      0.000000
name          0.000000
sex           0.000000
age          20.091673
sibsp         0.000000
parch         0.000000
ticket        0.000000
fare          0.076394
cabin        77.463713
embarked      0.152788
boat         62.872422
body         90.756303
home.dest    43.086325
dtype: float64

To visualize patterns in the missing data, use the missingno library. This library is useful for viewing contiguous areas of missing data, which would indicate that the missing data is not random (see Figure 4-1). The matrix function includes a sparkline along the right side. Patterns here would also indicate nonrandom missing data. You may need to limit the number of samples to be able to see the patterns:

>>> import missingno as msno
>>> ax = msno.matrix(orig_df.sample(500))
>>> ax.get_figure().savefig("images/mlpr_0401.png")
Where data is missing. No clear patterns jump out to the author.
Figure 4-1. Where data is missing. No clear patterns jump out to the author.

We can create a bar plot of missing data counts using pandas (see Figure 4-2):

>>> fig, ax = plt.subplots(figsize=(6, 4))
>>> (1 - df.isnull().mean()).abs().plot.bar(ax=ax)
>>> fig.savefig("images/mlpr_0402.png", dpi=300)
Percents of nonmissing data with pandas. Boat and body are leaky so we should ignore those. Interesting that some ages are missing.
Figure 4-2. Percents of nonmissing data with pandas. Boat and body are leaky so we should ignore those. Interesting that some ages are missing.

Or use the missingno library to create the same plot (see Figure 4-3):

>>> ax = msno.bar(orig_df.sample(500))
>>> ax.get_figure().savefig("images/mlpr_0403.png")
Percents of nonmissing data with missingno.
Figure 4-3. Percents of nonmissing data with missingno.

We can create a heat map showing if there are correlations where data is missing (see Figure 4-4). In this case, it doesn’t look like the locations where data are missing are correlated:

>>> ax = msno.heatmap(df, figsize=(6, 6))
>>> ax.get_figure().savefig("/tmp/mlpr_0404.png")
Correlations of missing data with missingno.
Figure 4-4. Correlations of missing data with missingno.

We can create a dendrogram showing the clusterings of where data is missing (see Figure 4-5). Leaves that are at the same level predict one another’s presence (empty or filled). The vertical arms are used to indicate how different clusters are. Short arms mean that branches are similar:

>>> ax = msno.dendrogram(df)
>>> ax.get_figure().savefig("images/mlpr_0405.png")
Dendrogram of missing data with missingno. We can see the columns without missing data on the upper right.
Figure 4-5. Dendrogram of missing data with missingno. We can see the columns without missing data on the upper right.

Dropping Missing Data

The pandas library can drop all rows with missing data with the .dropna method:

>>> df1 = df.dropna()

To drop columns, we can note what columns are missing and use the .drop method. We can pass in a list of column names or a single column name:

>>> df1 = df.drop(columns="cabin")

Alternatively, we can use the .dropna method and set axis=1 (drop along the column axis):

>>> df1 = df.dropna(axis=1)

Be careful about dropping data. I typically view this as a last resort option.

Imputing Data

Once you have a tool for predicting data, you can use that to predict missing data. The general task of defining values for missing values is called imputation.

If you are imputing data, you will need to build up a pipeline and use the same imputation logic during model creation and prediction time. The SimpleImputer class in scikit-learn will handle mean, median, and most frequent feature values.

The default behavior is to calculate the mean:

>>> from sklearn.impute import SimpleImputer
>>> num_cols = df.select_dtypes(
...     include="number"
... ).columns
>>> im = SimpleImputer()  # mean
>>> imputed = im.fit_transform(df[num_cols])

Provide strategy='median' or strategy='most_frequent' to change the replaced value to median or most common, respectively. If you wish to fill with a constant value, say -1, use strategy='constant' in combination with fill_value=-1.

Tip

You can use the .fillna method in pandas to impute missing values as well. Make sure that you do not leak data though. If you are filling in with the mean value, make sure you use the same mean value during model creation and model prediction time.

The most frequent and constant strategies may be used with numeric or string data. The mean and median require numeric data.

The fancyimpute library implements many algorithms and follows the scikit-learn interface. Sadly, most of the algorithms are transductive, meaning that you can’t call the .transform method by itself after fitting the algorithm. The IterativeImputer is inductive (has since been migrated from fancyimpute to scikit-learn) and supports transforming after fitting.

Adding Indicator Columns

The lack of data in and of itself may provide some signal to a model. The pandas library can add a new column to indicate that a value was missing:

>>> def add_indicator(col):
...     def wrapper(df):
...         return df[col].isna().astype(int)
...
...     return wrapper

>>> df1 = df.assign(
...     cabin_missing=add_indicator("cabin")
... )

Get Machine Learning Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.