Statistics is an undervalued topic in the world of software engineering and systems administration. It’s also misunderstood: many people I’ve spoken to over the years are operating on the misapprehension that “rubbing a little stats on it” will result in magic coming out the other end. Unfortunately, that isn’t quite the case.
However, I am happy to say that a basic lesson in statistics is both straightforward and incredibly useful to your work in monitoring.
Before we get into the statistics lesson, it’s helpful to understand a bit of the background story.
I fear that the prevalence and influence of Nagios has stifled the improvement of monitoring for many teams. Setting up an alert with Nagios is so simple, yet so often ineffective.1
If you want an alert on some metric with Nagios, you’re effectively comparing the current value against another value you’ve already set as a warning or critical threshold. For example, let’s say the returned value is 5 for the 15m load average. The check script is going to compare that value against the warning value or critical value, which might be 4 and 10, respectively. In this situation, Nagios would fire an alert for the check breaching the warning value, which is expected. Unfortunately, it isn’t very helpful.
As so often happens, systems can behave in unexpected (but totally fine) ways. For example, what if the value crossed the threshold for only one occurrence? What if the next check, 60 seconds later, came back with a value of 3.9? And the one after that was 4.1? As you might imagine, things would get noisy.
Nagios and similar tools have built mechanisms to quiet the noise for this particular sort of problem in the form of flapping detection. This works simply and rather naively: the monitoring tool will silence a check that swings back and forth from OK to alerting too many times in a set time period. In my opinion, mechanisms like flap detection just cover for bad alerting. What if there were a better way?
One of the core principles of the modern monitoring stack is to not throw away the metrics the monitoring service gives you. In the old days, Nagios didn’t record the values it received from a check, so you had no idea what trends were, whether last week or five minutes ago. Thankfully, it’s commonplace to record this data in a time series database now, even with Nagios (see Graphios and pnp4nagios). Something often overlooked is that keeping data opens up many new possibilities for problem detection through the use of statistics.
Every major time series database in use supports basic statistics. The configuration and usage is different across each one, so I’m going to spend our time together in this chapter on the statistics themselves, rather than their use in a particular tool.
If you’re used to the Nagios model of running checks, we’ll need to change your thinking just slightly. Instead of having the monitoring system gather the data and check the values against a set threshold at the same time (typical Nagios behavior), let’s decouple those into two separate functions.
We’ll need something to collect the data and write it to the time series database at regular intervals (I’m a huge fan of collectd for this purpose). Separately, we’ll have Nagios run its load average check not against the host directly, but against a metric stored in the time series database. You’ll need to use a different check script for this, one that is built to query your TSDB of choice (see Nagios + Graphite, Sensu + Graphite).
One of the new capabilities with this method is that you don’t have to run the check against just the last reported value anymore—you can run it against a larger number of values. This will allow you to make use of basic arithmetic and statistical functions, leading you to more accurate problem detection. This additional amount of data is fundamental to everything in this chapter, as we can’t tease out insights or predict the future without more of an idea of the past.
There seems to be a common feeling that if you “just rub some stats on it,” you’ll coax out some major insight. Unfortunately, this isn’t the case. A lot of work in statistics is figuring out which approach will work best against your data without resulting in incorrect answers.
I cannot hope to do proper justice in this book to all the statistical methods you could possibly use—after all, volume upon volume has been written on the topic for centuries. Rather, I intend to teach you some fundamentals, dispel some misconceptions, and leave you in a position to know where to look next. With that, let’s dive in.
Mean, more commonly known as average (and technically known as the arithmetic mean), is useful for determining what a dataset generally looks like without examining every single entry in the set. Calculating the mean is easy: add all the numbers in the dataset together, then divide by the number of entries in the dataset.
A common use of averaging in time series is something called the moving average. Rather than taking the entirety of the dataset and calculating the average, it calculates the average as new datapoints arrive. A by-product of this process is that it smooths a spiky graph out. This process is also used in TSDBs for storing rolled-up data and in every time series graphing tool when viewing a large set of metrics.2
For example, if you had a metric with values every minute for the past hour, you would have 60 unique datapoints. As we can see from Figure 4-1, it’s noisy and hard to see what’s going on:
Applying a rolling average with five minute intervals yields a very different graph. This resulting graph shown in Figure 4-2 is what we call smoothed.:
That is, through the process of averaging values, the peaks and valleys have been lost. There are pros and cons to this: by hiding the extremes in the dataset, we create a dataset with patterns that are easier to spot, but we also lose datapoints that could be valuable. With more smoothing comes a better visualization at the expense of accuracy. In other words, determining the correct amount of smoothing to apply is a balancing act.
Median is helpful when the average isn’t going to be accurate. Essentially, the median is the “middle” of the dataset. In fact, median is often used for analyzing income levels of entire populations precisely for reason of accuracy. If you have 10 people, all with incomes of $30,000/yr, the average of their incomes is $30,000, while the median is also $30,000. If one of those 10 people were to strike it rich and have an income of $500,000/yr, the average becomes $77,000, but the median stays the same. In essence, when dealing with datasets that are highly skewed in one direction, the median can often be more representative of the dataset than the mean.
To calculate the median, you first must sort the dataset in ascending order, then calculate the middle using the formula (n + 1) / 2, where n is the number of entries in the dataset.3 If your dataset contains an odd number of entries, the median is the exact middle entry. However, if your dataset contains an even number of entries, then the two middle numbers will be averaged, resulting in a median value that is not a number found in the original dataset.
For example, consider the dataset: 0, 1, 1, 2, 3, 5, 8, 13, 21. The median is 3. If we added a 10th number so the dataset becomes 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, then the median becomes 4.
Seasonality of data is when your datapoints adopt a repeating pattern. For example, if you were to record your commute time every day for a month, you would notice that it has a certain pattern to it. It may not always be the same time each day, but the pattern holds day-to-day. You use this kind of knowledge every day to help you plan and predict the future: because you know how long your commute normally takes, you know when you need to leave in order to make it to the office on time. Without this seasonality, planning your day would be impossible. Figure 4-3 shows an example of seasonality in web server requests.
If I know, based on previous data, that my web servers handle roughly 100 requests/sec on a given weekday, then I can also assume that half that number or double that number is maybe something worth investigating. Some tools allow you to apply this on a rolling basis, comparing datapoints now to datapoints at a previous time period, such as comparing req/sec currently to exactly the same time one week prior, one day prior, or even one hour prior. For workloads with a high degree of seasonality, you can thus make assumptions about what the future will look like. Not all workloads have seasonality—in fact, some have no discernible seasonality at all.
Quantiles are a statistical way of describing a specific point in a dataset. For example, the 50th quantile is the mid-point in the data (also known as the median). One of the most common quantiles in operations is the percentile, which is a way of describing a point in the dataset in terms of percentages (from 0 to 100).
Percentiles are commonly found in metered bandwidth billing and latency reporting, but the calculation is the same for both. First, the dataset is sorted in ascending order, then the top n percent of values is removed. The next largest number is the nth percentile.4 For example, bandwidth metering is often billed on a 95th percentile basis. To calculate that value, we would would drop the top 5% of values. We do this because it’s expected in bandwidth metering that the usage will be bursty, so paying for bandwidth on a 95th percentile basis is more fair. Similarly, using percentiles for latency reporting gives you a good idea of what the majority of the experience is like, ignoring the outliers.
By the nature of calculating a percentile, you’re dropping some amount of data. As a result, you can’t average percentiles together because you’re missing some of the data—the result will be inaccurate. In other words, calculating a daily 95th percentile and then averaging seven of those together does not give you an accurate weekly 95th^value. You’ll need to calculate the weekly percentile based on the full set of weekly values.
While using percentiles will give you an idea of what most of the values are (e.g., in the case of latency, what most users experience), don’t forget that you’re leaving off a good number of datapoints. When using percentiles to judge latency, it can be helpful to calculate the max latency as well, to see what the worst-case scenario is for users.
The standard deviation is a method of describing how close or far values are from the mean. That sounds great at first, but there’s a catch: while you can calculate it for any dataset, only a normally distributed dataset is going to yield the result you expect. Using standard deviation in a dataset that’s not normally distributed may result in unexpected answers.
One handy bit about standard deviation is that the amount of data within specific deviations is predictable. As you can see from Figure 4-4, 68% of the data resides within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. Keep in mind that this holds true only for normally distributed datasets.
I mention standard deviation only because there’s bad news: most of the data you’ll be working with doesn’t fit a model where standard deviation will work well. You’re better off skipping right past using standard deviation rather than wasting time wondering why the calculation’s results aren’t what you were expecting.
This section has only barely scratched the surface when it comes to the world of statistics, but I’ve tried to focus on the most common and highest-impact approaches for operations and engineering work. To recap:
Average is the most common and useful function you’ll use, as it’s widely applicable to lots of different datasets. Median is also quite handy, for some datasets.
Seasonality is just a fancy way of talking about patterns in data based on time. Look at your traffic log and I bet you’ll see seasonality.
Percentiles are helpful for understanding what the bulk of your data looks like, but be careful: they inherently ignore the extreme datapoints.
Standard deviation is a useful tool, but not so much for the sort of data you’ll be dealing with.
I’ll leave you with a few questions to consider when thinking about your data.
Does it have a large skew in either direction? That is, do the datapoints cluster at either end of a graph?
Are extreme outliers common?
Are there upper and lower bounds for datapoints? For example, latency measurements can, in theory, be effectively infinite in the positive direction (bounded on the low end by zero), while CPU utilization percentage is bounded on both ends (0% and 100%).
By asking these questions of your data, you’ll start to understand which statistical approaches may work well and which may not.
And with that, we’ve reached the end of Part I of Practical Monitoring! In Part II we’ll get into the nitty-gritty of “What should I be monitoring? How do I do it?”
1 I don’t mean to pick on Nagios—it’s just that Nagios, thanks to its influence, has set the expected standards in many tools. There are plenty of other tools just as guilty.
2 Loading several thousand datapoints from disk to display a graph takes a very long time, and you probably don’t care about granularity when viewing four weeks’ worth of data.
3 Your TSDB hides the underlying calculation from you, but trust me, this is what it’s doing.
4 This is a rough definition and glosses over some subtleties of the underlying math. A more thorough treatment of percentiles can be found in “Statistics For Engineers” (Heinrich Hartmann, ACM Vol 59, No 7).