Chapter 1. Introduction
Wouldnât it be amazing to have a system that warned you about new behaviors and data patterns in time to fix problems before they happened, or seize opportunities the moment they arise? Wouldnât it be incredible if this system was completely foolproof, warning you about every important change, but never ringing the alarm bell when it shouldnât? That system is the holy grail of anomaly detection. It doesnât exist, and probably never will. However, we shouldnât let imperfection make us lose sight of the fact that useful anomaly detection is possible, and benefits those who apply it appropriately.
Anomaly detection is a set of techniques and systems to find unusual behaviors and/or states in systems and their observable signals. We hope that people who read this book do so because they believe in the promise of anomaly detection, but are confused by the furious debates in thought-leadership circles surrounding the topic. We intend this book to help demystify the topic and clarify some of the fundamental choices that have to be made in constructing anomaly detection mechanisms. We want readers to understand why some approaches to anomaly detection work better than others in some situations, and why a better solution for some challenges may be within reach after all.
This book is not intended to be a comprehensive source for all information on the subject. That book would be 1000 pages long and would be incomplete at that. It is also not intended to be a step-by-step guide to building an anomaly detection system that will work well for all applicationsâweâre pretty sure that a âgeneral solutionâ to anomaly detection is impossible. We believe the best approach for a given situation is dependent on many factors, not least of which is the cost/benefit analysis of building more complex systems. We hope this book will help you navigate the labyrinth by outlining the tradeoffs associated with different approaches to anomaly detection, which will help you make judgments as you reach forks in the road.
We decided to write this book after several years of work applying anomaly detection to our own problems in monitoring and related use cases. Both of us work at VividCortex, where we work on a large-scale, specialized form of database monitoring. At VividCortex, we have flexed our anomaly detection muscles in a number of ways. We have built, and more importantly discarded, dozens of anomaly detectors over the last several years. But not only that, we were working on anomaly detection in monitoring systems even before VividCortex. We have tried statistical, heuristic, machine learning, and other techniques.
We have also engaged with our peers in monitoring, DevOps, anomaly detection, and a variety of other disciplines. We have developed a deep and abiding respect for many people, projects and products, and companies including Ruxit among others. We have tried to share our challenges, successes, and failures through blogs, open-source software, conference talks, and now this book.
Why Anomaly Detection?
Monitoring, the practice of observing systems and determining if theyâre healthy, is hard and getting harder. There are many reasons for this: we are managing many more systems (servers and applications or services) and much more data than ever before, and we are monitoring them in higher resolution. Companies such as Etsy have convinced the community that it is not only possible but desirable to monitor practically everything we can, so we are also monitoring many more signals from these systems than we used to.
Any of these changes presents a challenge, but collectively they present a very difficult one indeed. As a result, now we struggle with making sense out of all of these metrics.
Traditional ways of monitoring all of these metrics can no longer do the job adequately. There is simply too much data to monitor.
Many of us are used to monitoring visually by actually watching charts on the computer or on the wall, or using thresholds with systems like Nagios. Thresholds actually represent one of the main reasons that monitoring is too hard to do effectively. Thresholds, put simply, donât work very well. Setting a threshold on a metric requires a system administrator or DevOps practitioner to make a decision about the correct value to configure.
The problem is, there is no correct value. A static threshold is just that: static. It does not change over time, and by default it is applied uniformly to all servers. But systems are neither similar nor static. Each system is different from every other, and even individual systems change, both over the long term, and hour to hour or minute to minute.
The result is that thresholds are too much work to set up and maintain, and cause too many false alarms and missed alarms. False alarms, because normal behavior is flagged as a problem, and missed alarms, because the threshold is set at a level that fails to catch a problem.
You may not realize it, but threshold-based monitoring is actually a crude form of anomaly detection. When the metric crosses the threshold and triggers an alert, itâs really flagging the value of the metric as anomalous. The root of the problem is that this form of anomaly detection cannot adapt to the systemâs unique and changing behavior. It cannot learn what is normal.
Another way you are already using anomaly detection techniques is with features such as Nagiosâs flapping suppression, which disallows alarms when a checkâs result oscillates between states. This is a crude form of a low-pass filter, a signal-processing technique to discard noise. It works, but not all that well because its idea of noise is not very sophisticated.
A common assumption is that more sophisticated anomaly detection can solve all of these problems. We assume that anomaly detection can help us reduce false alarms and missed alarms. We assume that it can help us find problems more accurately with less work. We assume that it can suppress noisy alerts when systems are in unstable states. We assume that it can learn what is normal for a system, automatically and with zero configuration.
Why do we assume these things? Are they reasonable assumptions? That is one of the goals of this book: to help you understand your assumptions, some of which you may not realize youâre making. With explicit assumptions, we believe you will be prepared to make better decisions. You will be able to understand the capabilities and limitations of anomaly detection, and to select the right tool for the task at hand.
The Many Kinds of Anomaly Detection
Anomaly detection is a complicated subject. You might understand this already, but nevertheless it is probably still more complicated than you believe. There are many kinds of anomaly detection techniques. Each technique has a dizzying number of variations. Each of these is suitable, or unsuitable, for use in a number of scenarios. Each of them has a number of edge cases that can cause poor results. And many of them are based on advanced math, statistics, or other disciplines that are beyond the reach of most of us.
Still, there are lots of success stories for anomaly detection in general. In fact, as a profession, we are late at applying anomaly detection on a large scale to monitoring. It certainly has been done, but if you look at other professions, various types of anomaly detection are standard practice. This applies to domains such as credit card fraud detection, monitoring for terrorist activity, finance, weather, gambling, and many more too numerous to mention. In contrast to this, in systems monitoring we generally do not regard anomaly detection as a standard practice, but rather as something potentially promising but leading edge.
The authors of this book agree with this assessment, by and large. We also see a number of obstacles to be overcome before anomaly detection is regarded as a standard part of the monitoring toolkit:
-
It is difficult to get started, because thereâs so much to learn before you can even start to get results.
-
Even if you do a lot of work and the results seem promising, when you deploy something into production you can find poor results often enough that nothing usable comes of your efforts.
-
General-purpose solutions are either impossible or extremely difficult to achieve in many domains. This is partially because of the incredible diversity of machine data. There are also apparently an almost infinite number of edge cases and potholes that can trip you up. In many of these cases, things appear to work well even when they really donât, or they accidentally work well, leading you to think that it is by design. In other words, whether something is actually working or not is a very subtle thing to determine.
-
There seems to be an unlimited supply of poor and incomplete information to be found on the Internet and in other sources. Some of it is probably even in this book.
-
Anomaly detection is such a trendy topic, and it is currently so cool and thought-leadery to write or talk about it, that there seem to be incentives for adding insult to the already injurious amount of poor information just mentioned.
-
Many of the methods are based on statistics and probability, both of which are incredibly unintuitive, and often have surprising outcomes. In the authorsâ experience, few things can lead you astray more quickly than applying intuition to statistics.
As a result, anomaly detection seems to be a topic that is all about extremes. Some people try it, or observe other peopleâs efforts and results, and conclude that it is impossible or difficult. They give up hope. This is one extreme. At the other extreme, some people find good results, or believe they have found good results, at least in some specific scenario. They mistakenly think they have found a general purpose solution that will work in many more scenarios, and they evangelize it a little too much. This overenthusiasm can result in negative press and vilification from other people. Thus, we seem to veer between holy grails and despondency. Each extreme is actually an overcorrection that feeds back into the cycle.
Sadly, none of this does much to educate people about the true nature and benefits of anomaly detection. One outcome is that a lot of people are missing out on benefits that they could be getting. Another is that they may not be informed enough to have realistic opinions about commercially available anomaly detection solutions. As Zen Master Hakuin said,
Not knowing how near the truth is, we seek it far away.
Conclusions
If you are like most of our friends in the DevOps and web operations communities, you probably picked up this book because youâve been hearing a lot about anomaly detection in the last few years, and youâre intrigued by it. In addition to the previously-mentioned goal of making assumptions explicit, we hope to be able to achieve a number of outcomes in this book.
-
We want to help orient you to the subject and the landscape in general. We want you to have a frame of reference for thinking about anomaly detection, so you can make your own decisions.
-
We want to help you understand how to assess not only the meaning of the answers you get from anomaly detection algorithms, but how trustworthy the answers might be.
-
We want to teach you some things that you can actually apply to your own systems and your own problems. We donât want this to be just a bunch of theory. We want you to put it into practice.
-
We want your time spent reading this book to be useful beyond this book. We want you to be able to apply what you have learned to topics we donât cover in this book.
If you already know anything about anomaly detection, statistics, or any of the other things we cover in this book, youâre going to see that we omit or gloss over a lot of important information. That is inevitable. From prior experience, we have learned that it is better to help people form useful thought processes and mental models than to tell them what to think.
As a result of this, we hope you will be able to combine the material in this book with your existing tools and skills to solve problems on your systems. By and large, we want you to get better at what you already do, and learn a new trick or two, rather than solving world hunger. If you ask, âwhat can I do thatâs a little better than Nagios?â youâre on the right track.
Anomaly detection is not a black and white topic. There is a lot of gray area, a lot of middle ground. Despite the complexity and richness of the subject matter, it is both fun and productive. And despite the difficulty, there is a lot of promise for applying it in practice.
Somewhere between static thresholds and magic, there is a happy medium. In this book, we strive to help you find that balance, while avoiding some of the sharp edges.
Get Anomaly Detection for Monitoring now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.