Chapter 6. Anomaly Detection

An anomaly is something that is different from other members of the same group. In data, an anomaly is a record, an observation, or a value that differs from the remaining data points in a way that raises concerns or suspicions. Anomalies go by a number of different names, including outliers, novelties, noise, deviations, and exceptions, to name a few. I’ll use the terms anomaly and outlier interchangeably throughout this chapter, and you may see the other terms used in discussions of this topic as well. Anomaly detection can be the end goal of an analysis or a step within a broader analysis project.

Anomalies typically have one of two sources: real events that are extreme or otherwise unusual, or errors introduced during data collection or processing. While many of the steps used to detect outliers are the same regardless of the source, how we choose to handle a particular anomaly depends on the root cause. As a result, understanding the root cause and distinguishing between the two types of causes is important to the analysis process.

Real events can generate outliers for a variety of reasons. Anomalous data can signal fraud, network intrusion, structural defects in a product, loopholes in policies, or product use that wasn’t intended or envisioned by the developers. Anomaly detection is widely used to root out financial fraud, and cybersecurity also makes use of this type of analysis. Sometimes anomalous data results not because a bad actor is trying ...

Get SQL for Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.