Outlier Detection: Just Because They're Odd Doesn't Mean They're Unimportant

Outliers are the odd points in a dataset—the ones that don't fit somehow. Historically, that's meant extreme values, meaning quantities that were either too large or small to have come naturally from the same process as the other observations in the dataset.

The only reason people used to care about outliers was because they wanted to get rid of them. Statisticians a hundred years ago had a lot in common with the Borg: a data point needed to assimilate or die. However, this was done with good reason (in the case of the statistician)—outliers can move averages and mess with spread measurements in the data. A good example of outlier removal is in gymnastics, where the highest and lowest judges' scores are always trimmed from the data before taking the average score.

Outliers have a knack for messing up machine learning models. For example, in Chapters 6 and 7 you looked at predicting pregnant customers based on their purchase data. What if a store miscoded some items on the shelves of the pharmacy and were registering multi-vitamin purchases as folic acid purchases? The customers with those faulty purchase vectors are outliers that shift the relationship of pregnancy-to-folic-acid-purchasing in a way that harms the AI model's understanding.

Once upon a time when I consulted for the government, my company found a water storage facility that the United States had in Dubai that had been valued at billions ...

Get Data Smart: Using Data Science to Transform Information into Insight now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.