Chapter 4. Missing Data
We need to deal with missing data. The previous chapter showed an example. This chapter will dive into it a bit more. Most algorithms will not work if data is missing. Notable exceptions are the recent boosting libraries: XGBoost, CatBoost, and LightGBM.
As with many things in machine learning, there are no hard answers for how to treat missing data. Also, missing data could represent different situations. Imagine census data coming back and an age feature being reported as missing. Is it because the sample didn’t want to reveal their age? They didn’t know their age? The one asking the questions forgot to even ask about age? Is there a pattern to missing ages? Does it correlate to another feature? Is it completely random?
There are also various ways to handle missing data:
-
Remove any row with missing data
-
Remove any column with missing data
-
Impute missing values
-
Create an indicator column to signify data was missing
Examining Missing Data
Let’s go back to the Titanic data. Because Python treats True and False as 1 and 0, respectively, we can use this trick in pandas to get percent of missing data:
>>>df.isnull().mean()*100pclass 0.000000survived 0.000000name 0.000000sex 0.000000age 20.091673sibsp 0.000000parch 0.000000ticket 0.000000fare 0.076394cabin 77.463713embarked 0.152788boat 62.872422body 90.756303home.dest 43.086325dtype: float64
To visualize patterns in the missing data, use the missingno library. This library is useful ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access