Recall that one of our goals for this book is to help you actually get anomaly detection running in production and solving monitoring problems you have with your current systems.
Typical goals for adding anomaly detection probably include:
To avoid setting or changing thresholds per server, because machines differ from each other
To avoid modifying thresholds when servers, features, and workloads change over time
To avoid static thresholds that throw false alerts at some times of the day or week, and miss problems at other times
In general you can probably describe these goals as “just make Nagios a little better for some checks.”
Another goal might be to find all metrics that are abnormal without generating alerts, for use in diagnosing problems. We consider this to be a pretty hard problem because it is very general. You probably understand why at this point in the book. We won’t focus on this goal in this chapter, although you can easily apply the discussion in this chapter to that approach on a case by case basis.
The best place to begin is often where you experience the most painful monitoring problem right now. Take a look at your alert history or outages. What’s the source of the most noise or the place where problems happen the most without an alert to notify you?
Not all of the alerting problems you’ll find are solvable with anomaly detection. Some come from alerting ...