Chapter 4. Missing Data
We need to deal with missing data. The previous chapter showed an example. This chapter will dive into it a bit more. Most algorithms will not work if data is missing. Notable exceptions are the recent boosting libraries: XGBoost, CatBoost, and LightGBM.
As with many things in machine learning, there are no hard answers for how to treat missing data. Also, missing data could represent different situations. Imagine census data coming back and an age feature being reported as missing. Is it because the sample didn’t want to reveal their age? They didn’t know their age? The one asking the questions forgot to even ask about age? Is there a pattern to missing ages? Does it correlate to another feature? Is it completely random?
There are also various ways to handle missing data:
-
Remove any row with missing data
-
Remove any column with missing data
-
Impute missing values
-
Create an indicator column to signify data was missing
Examining Missing Data
Let’s go back to the Titanic data. Because Python treats True
and False
as 1
and 0
, respectively, we can use this trick in pandas to get percent of missing data:
>>>
df
.
isnull
()
.
mean
()
*
100
pclass 0.000000
survived 0.000000
name 0.000000
sex 0.000000
age 20.091673
sibsp 0.000000
parch 0.000000
ticket 0.000000
fare 0.076394
cabin 77.463713
embarked 0.152788
boat 62.872422
body 90.756303
home.dest 43.086325
dtype: float64
To visualize patterns in the missing data, use the missingno library. This library is useful for viewing contiguous areas of missing data, which would indicate that the missing data is not random (see Figure 4-1). The matrix
function includes a sparkline along the right side. Patterns here would also indicate nonrandom missing data. You may need to limit the number of samples to be able to see the patterns:
>>>
import
missingno
as
msno
>>>
ax
=
msno
.
matrix
(
orig_df
.
sample
(
500
))
>>>
ax
.
get_figure
()
.
savefig
(
"images/mlpr_0401.png"
)
We can create a bar plot of missing data counts using pandas (see Figure 4-2):
>>>
fig
,
ax
=
plt
.
subplots
(
figsize
=
(
6
,
4
))
>>>
(
1
-
df
.
isnull
()
.
mean
())
.
abs
()
.
plot
.
bar
(
ax
=
ax
)
>>>
fig
.
savefig
(
"images/mlpr_0402.png"
,
dpi
=
300
)
Or use the missingno library to create the same plot (see Figure 4-3):
>>>
ax
=
msno
.
bar
(
orig_df
.
sample
(
500
))
>>>
ax
.
get_figure
()
.
savefig
(
"images/mlpr_0403.png"
)
We can create a heat map showing if there are correlations where data is missing (see Figure 4-4). In this case, it doesn’t look like the locations where data are missing are correlated:
>>>
ax
=
msno
.
heatmap
(
df
,
figsize
=
(
6
,
6
))
>>>
ax
.
get_figure
()
.
savefig
(
"/tmp/mlpr_0404.png"
)
We can create a dendrogram showing the clusterings of where data is missing (see Figure 4-5). Leaves that are at the same level predict one another’s presence (empty or filled). The vertical arms are used to indicate how different clusters are. Short arms mean that branches are similar:
>>>
ax
=
msno
.
dendrogram
(
df
)
>>>
ax
.
get_figure
()
.
savefig
(
"images/mlpr_0405.png"
)
Dropping Missing Data
The pandas library can drop all rows with missing data with the .dropna
method:
>>>
df1
=
df
.
dropna
()
To drop columns, we can note what columns are missing and use the .drop
method. We can pass in a list of column names or a single column name:
>>>
df1
=
df
.
drop
(
columns
=
"cabin"
)
Alternatively, we can use the .dropna
method and set axis=1
(drop along the column axis):
>>>
df1
=
df
.
dropna
(
axis
=
1
)
Be careful about dropping data. I typically view this as a last resort option.
Imputing Data
Once you have a tool for predicting data, you can use that to predict missing data. The general task of defining values for missing values is called imputation.
If you are imputing data, you will need to build up a pipeline and use the same imputation logic during model creation and prediction time. The SimpleImputer
class in scikit-learn will handle mean, median, and most frequent feature values.
The default behavior is to calculate the mean:
>>>
from
sklearn.impute
import
SimpleImputer
>>>
num_cols
=
df
.
select_dtypes
(
...
include
=
"number"
...
)
.
columns
>>>
im
=
SimpleImputer
()
# mean
>>>
imputed
=
im
.
fit_transform
(
df
[
num_cols
])
Provide strategy='median'
or strategy='most_frequent'
to change the replaced value to median or most common, respectively. If you wish to fill with a constant value, say -1
, use strategy
='constant'
in combination with fill_value=-1
.
Tip
You can use the .fillna
method in pandas to impute missing values as well. Make sure that you do not leak data though. If you are filling in with the mean value, make sure you use the same mean value during model creation and model prediction time.
The most frequent and constant strategies may be used with numeric or string data. The mean and median require numeric data.
The fancyimpute library implements many algorithms and follows the scikit-learn interface. Sadly, most of the algorithms are transductive, meaning that you can’t call the .transform
method by itself after fitting the algorithm. The IterativeImputer
is inductive (has since been migrated from fancyimpute to scikit-learn) and supports transforming after fitting.
Adding Indicator Columns
The lack of data in and of itself may provide some signal to a model. The pandas library can add a new column to indicate that a value was missing:
>>>
def
add_indicator
(
col
):
...
def
wrapper
(
df
):
...
return
df
[
col
]
.
isna
()
.
astype
(
int
)
...
...
return
wrapper
>>>
df1
=
df
.
assign
(
...
cabin_missing
=
add_indicator
(
"cabin"
)
...
)
Get Machine Learning Pocket Reference now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.