Handling outliers

Your data will often have outlying values, or data points that are far away from the expected value for your dataset. Sometimes, outliers are caused by noise or errors (somebody recording a height of 7'3" rather than 6'3"), but other times, outliers are legitimate data points (one celebrity with a Twitter reach of 10 million followers joining your service where most of the users have 10,000 to 100,000 followers). In either case, you'll first want to identify outliers so that you can determine what to do with them.

One approach to identifying outliers is to calculate the mean and standard deviation of your dataset, and determine how many standard deviations away from the mean each data point is. The standard deviation of ...

Get Hands-on Machine Learning with JavaScript now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.