Chapter 16. Handling Missing Data
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.
In this chapter, we will discuss some general considerations for missing
data, look at how Pandas chooses to represent it, and explore some
built-in Pandas tools for handling missing data in Python. Here and
throughout the book, I will refer to missing data in general as null,
NaN
, or NA values.
Trade-offs in Missing Data Conventions
A number of approaches have been developed to track the presence of
missing data in a table or DataFrame
. Generally, they revolve around
one of two strategies: using a mask that globally indicates missing
values, or choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it might involve appropriation of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific
convention, such as indicating a missing integer value with –9999 or
some rare bit pattern, or it could be a more global convention, such as
indicating a missing floating-point value with NaN
(Not a Number), a special value that is part of the IEEE floating-point ...
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.