Skip to Content
Data Smart: Using Data Science to Transform Information into Insight
book

Data Smart: Using Data Science to Transform Information into Insight

by John W. Foreman
November 2013
Beginner to intermediate
432 pages
10h 39m
English
Wiley
Audiobook available
Content preview from Data Smart: Using Data Science to Transform Information into Insight

9

Outlier Detection: Just Because They're Odd Doesn't Mean They're Unimportant

Outliers are the odd points in a dataset—the ones that don't fit somehow. Historically, that's meant extreme values, meaning quantities that were either too large or small to have come naturally from the same process as the other observations in the dataset.

The only reason people used to care about outliers was because they wanted to get rid of them. Statisticians a hundred years ago had a lot in common with the Borg: a data point needed to assimilate or die. However, this was done with good reason (in the case of the statistician)—outliers can move averages and mess with spread measurements in the data. A good example of outlier removal is in gymnastics, where the highest and lowest judges' scores are always trimmed from the data before taking the average score.

Outliers have a knack for messing up machine learning models. For example, in Chapters 6 and 7 you looked at predicting pregnant customers based on their purchase data. What if a store miscoded some items on the shelves of the pharmacy and were registering multi-vitamin purchases as folic acid purchases? The customers with those faulty purchase vectors are outliers that shift the relationship of pregnancy-to-folic-acid-purchasing in a way that harms the AI model's understanding.

Once upon a time when I consulted for the government, my company found a water storage facility that the United States had in Dubai that had been valued at billions ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Building an Effective Data Science Practice: A Framework to Bootstrap and Manage a Successful Data Science Practice

Building an Effective Data Science Practice: A Framework to Bootstrap and Manage a Successful Data Science Practice

Vineet Raina, Srinath Krishnamurthy
Python: Advanced Predictive Analytics

Python: Advanced Predictive Analytics

Ashish Kumar, Joseph Babcock

Publisher Resources

ISBN: 9781118661468Purchase book