Practical Machine Learning: A New Look at Anomaly Detection

Chapter 4. More Complex, Adaptive Models

As we saw in the previous chapter, it is relatively easy to build the very simplest anomaly detector that looks for deviations from an ideal value. Tools like the t-digest can help by analyzing historical data to accurately find a good threshold. Statistically, such a system is building a model of the input data that describes the data as a constant value with some additive noise. For a system like most of the ones we have seen so far, the model is nearly trivial, but it is a model nonetheless.

But what about the more complicated situations, such as the one shown at the end of the last chapter in Figure 3-3? Systems that are not stationary or that have complicated patterns even when they are roughly periodic require something more than a simple threshold detector. And what happens when conditions change?

What is needed is an adaptive machine-learning model for anomaly detection. In Chapter 2 we discussed the idea of a probabilistic model that is trained using histories of past events to estimate their likelihood of occurrence as a way to describe what is normal. This type of model is adaptive: as small fluctuations occur in the majority of events, our model can adjust its view of “normal” accordingly. In other words, it adapts to reasonable variations. Back to our bird-watching analogy, if our bird detector is looking for unusual species (“accidentals” in bird-watching jargon) or significant and possibly catastrophic changes in the population of normal species, we might want our model to be able to adjust to small changes in the local population that respond to slight shifts in weather conditions, availability of seeds and water, or excessive activity of the neighbor’s cat. We want our model to be adaptive.

Conceptually, it is relatively easy to imagine extending the constant threshold model of Chapter 3 by allowing the mean value to vary and then set all thresholds relative to that mean value. Statistically speaking, what we are doing is describing our input as a time-varying base value combined with additive noise that has constant distribution and zero mean. By building a model of exactly how we expect that base value to vary, we can build an anomaly detector that produces an alert whenever the input behaves sufficiently differently from what is expected.

Let’s take a look at the EKG signal that we talked about in Chapter 2 of this report. The pulses in the EKG that record the heartbeats of the patient being measured are each highly similar to one another. In fact, substantial changes in the shape of the waveforms in the EKG often indicate either some sort of physiological problem or equipment malfunction. So the problem of anomaly detection in an EKG can be viewed as the problem of how to build a model of what heartbeats should look like, and then how to compare the observed heartbeats to this ideal model. If the model is a good one—in other words a good estimate of normal heart behavior—then a measurement of error for observed behavior relative to this ideal model will highlight anomalous behavior. If the magnitude of the error stands out, it may be a flag that highlights irregular heartbeats or a failure in the monitor. This approach can work to find anomalies in a variety of situations, not just for an EKG.

To understand this mathematically (just stay with us), think back to Chapter 2, in which we stated that anomaly detection involves modeling what is normal, looking for events that lie far from normal, and having to decide how to determine that. In the type of example described in this current chapter, we have a continuous measurement, and at any point in time, we have a single value. The probabilistic model for “normal” in this situation involves a sum of a base value plus random noise. We can estimate the base value using our model, and subtracting this from our observational input data leaves the random noise, which is our reconstruction error. According to our model, the error noise has an average value of zero and stationary distribution. When the remaining noise is large, we have an anomaly, and t-digest can decide what we should consider a large value of noise to be.

Windows and Clusters

We still have the challenge of finding an approachable, practical way to model normal for a very complicated curve such as the EKG. To do this, we are going to turn to a type of machine learning known as deep learning, at least in an introductory way. Here’s how.

Deep learning involves letting a system learn in several layers, in order to deal with large and complicated problems in approachable steps. With a nod toward this approach, we’ve found a simple way to do this for curves such as the EKG that have repeated components separated in time rather than superposed. We take advantage of the repetitive and separated nature of an EKG curve in order to accurately model its complicated shape.

Figure 4-1 shows an expanded view of two heartbeats from an EKG signal. Each heartbeat consists of several phases or pulses that correspond to electrical activity in the heart. The first part of the heartbeat is the P wave, followed by the QRS complex, or group of pulses, and then the T wave. The recording shown here was made with a portable recording device and doesn’t show all of the detail in the QRS complex, nor is the U wave visible just after the T wave. The thing to notice is that each wave is strikingly similar from heartbeat to heartbeat. That heart is beating in a normal pattern.

To build a model that uses this similarity, we use a mathematical trick called windowing. It’s a way of dealing with the regular but complex patterns when you need to build a model that can accurately predict them. This method involves extracting short sequences of the original signal in such a way that that the short sequences can be added back together to re-create the original signal.

Figure 4-1. An EKG signal is comprised of components that are complex but highly repetitive.

The first step of our analysis is to do windowing to break up the larger pattern into small components. Figure 4-2 shows how the EKG for these two heartbeats are broken up into a sequence of nine overlapping short signals. As you can see, several of these signals are similar to each other. This similarity can be exploited to build a heartbeat model by aligning and clustering all of the short signals observed in a long recording. In this example, the clustering was done using a ball k-means algorithm from the Apache Mahout library. We chose this algorithm because our sample data here is not huge, so we could do this as an in-memory operation. With larger data sets, we might have done the clustering with a different algorithm, such as streaming k-means, also from Mahout.

The clustering operation essentially builds a catalog of these shapes so that the original signal can be encoded by simply recording which shape from the catalog is used in each time window, along with a scaling factor for each shape.

Figure 4-2. Windowing decomposes the original signal into short segments that can be added together to reconstruct the original signal.

By looking at a long time series of data from an EKG, you can construct a dictionary of component shapes that are typical for normal heart behavior. Figure 4-3 shows a selection of 64 out of 400 shapes from a dictionary constructed based on several hours of a real EKG recording. These shapes clearly show some of the distinctive patterns found in EKG recordings.

Dictionary of component shapes. Clustering finds the most commonly used signal shapes for reconstructing a representation of normal heartbeats.

Figure 4-3. Dictionary of component shapes. Clustering finds the most commonly used signal shapes for reconstructing a representation of normal heartbeats.

Matches with the Windowed Reconstruction: Normal Function

Now let’s see what happens when this technique is applied to a new EKG signal. Remember, the goal here is to build a model of an observed signal, compare it to the ideal model, and note the level of error between the two. We assume that the shapes used as components (the dictionary of shapes shown in Figure 4-3) are accurate depictions of a normal signal. Given that, a low level of errors in the comparison between the re-constructed signal and the ideal suggests that the observed signal is close to normal. In contrast, a large error points to a mismatch. This is not a test of the reconstruction method but rather a test of the observed signal. A mismatch indicates an abnormal signal, and thus an anomaly in heart function.

Figure 4-4 shows a reconstruction of an EKG signal. The top trace is the original signal, the middle one is the reconstructed signal, and the bottom trace shows the difference between the first two. This last signal shows how well the reconstruction encodes the original signal and is called the reconstruction error.

Reconstruction of normal pattern for heartbeats using windowing and clustering. The reconstruction error (bottom trace) for an EKG signal (top trace) is computed by subtracting the reconstructed signal (middle trace) from the original. Notice that the reconstruction error (bottom trace) is small and relatively uniform.

Figure 4-4. Reconstruction of normal pattern for heartbeats using windowing and clustering. The reconstruction error (bottom trace) for an EKG signal (top trace) is computed by subtracting the reconstructed signal (middle trace) from the original. Notice that the reconstruction error (bottom trace) is small and relatively uniform.

As long as the original signal looks very much like the signals that were used to create the shape dictionary, the reconstruction will be very good, and the reconstruction error will be small. The dictionary is thus a model of what EKG signals can look like, and the reconstruction error represents the degree to which the signal being reconstructed looks like a heartbeat. A large reconstruction error occurs when the input does not look much like a heartbeat (or, in other words, is anomalous).

Note in Figure 4-4 how the reconstruction error has a fixed baseline. This fact suggests that we can apply a thresholding anomaly detector as described in Chapter 3 to the reconstruction error to get a complete anomaly detector. Figure 4-5 shows a block diagram of an anomaly detector built based on this idea.

Signal reconstruction error from an auto-encoder can be used to find anomalies in a complex signal. Input signal x is analyzed using an encoder, which reconstructs x using a model in the form of a shape dictionary to produce a reconstructed signal x’. The difference, x-x', is the reconstruction error δ. Comparing δ to a threshold h gives us an alarm signal when the encoder cannot reconstruct x accurately as indicated by a large reconstruction error δ.

Figure 4-5. Signal reconstruction error from an auto-encoder can be used to find anomalies in a complex signal. Input signal x is analyzed using an encoder, which reconstructs x using a model in the form of a shape dictionary to produce a reconstructed signal x’. The difference, x-x', is the reconstruction error δ. Comparing δ to a threshold h gives us an alarm signal when the encoder cannot reconstruct x accurately as indicated by a large reconstruction error δ.

Essentially what is happening here is that the encoder can only reproduce very specific kinds of signals. The encoder is used to reduce the complex input signal to a reconstruction error, which is large whenever the input isn’t the kind of signal the encoder can handle. This reconstruction error is just the sort of stationary signal that is appropriate for use with the methods described in Chapter 3, and so we can use the t-digest on the reconstruction error to look for anomalies.

Mismatches with the Windowed Reconstruction: Anomalous Function

That this approach can find interesting anomalies is shown in Figure 4-6, where you can see a spike in the reconstruction error just after 101 seconds. It is clear that the input signal (top trace) is not faithfully rendered by the reconstruction, but at this scale, it is hard to see exactly why.

Reconstruction for a heart signal displaying anomalous behavior. Top trace is the original EKG signal. The bottom trace shows the reconstruction error that is computed by subtracting the reconstructed signal (middle trace) from the original. Notice the spike in the error at just past 101 seconds. That error spike indicates that the reconstruction shown in the middle panel was unable to reproduce that section of the original signal shown at the top.

Figure 4-6. Reconstruction for a heart signal displaying anomalous behavior. Top trace is the original EKG signal. The bottom trace shows the reconstruction error that is computed by subtracting the reconstructed signal (middle trace) from the original. Notice the spike in the error at just past 101 seconds. That error spike indicates that the reconstruction shown in the middle panel was unable to reproduce that section of the original signal shown at the top.

If you expand the time scale, however, it is easy to see what is happening. Figure 4-7 shows that at about 101.25 seconds, the QRS complex in the heartbeat is actually a double pulse. This double pulse is very different from anything that appears in a recording of a normal heartbeat, and that means that there isn’t a shape in the dictionary that can be used to reconstruct this signal.

Expanded view of the anomalous heartbeat. The anomaly (indicated by arrows) was detected by finding an unusually large reconstruction error.

Figure 4-7. Expanded view of the anomalous heartbeat. The anomaly (indicated by arrows) was detected by finding an unusually large reconstruction error.

This kind of anomaly detector can’t say what the anomaly is. All it can do is tell us that something unusual has happened. Expert human judgment is required for the interpretation of the physiological meaning of this anomaly, but having a system that can draw attention to where human judgment should best be applied helps if only by avoiding fatigue.

Other systems can make use of this general model-based signal reconstruction technique. The signal doesn’t have to be as perfectly periodic for this to work, and it can involve multidimensional inputs instead of just a single signal. The key is that there is a model that can encode the input very concisely. You can find the code and data used in this example on GitHub.

A Powerful But Simple Technique

Note that the model used here to encode the EKG signal is fairly simple. The technique used to produce this model gives surprisingly good results given its low level of complexity. It can be used to analyze signals in a variety of situations including sound, vibration, or flow, such as what might be encountered in manufacturing or other industrial settings. Not all signals can be accurately analyzed with a model as simple as the one used here, particularly if the fundamental patterns are superposed as opposed to separated, as with the EKG. Such superposed signals may require a more elaborate model. Even the EKG signal requires a fancier model if we want to not only see the shape of the individual heartbeats but also features such as irregular heartbeats. One approach for building a more elaborate model is to use deep learning to build a reconstruction model that understands a waveform on both short and long time scales. The model shown here can be extended to a form of deep learning by recursively clustering the groups of cluster weights derived by the model described here.

Regardless, however, of the details of the model itself, the architecture shown here will still work for inputs roughly like this one or even inputs that consist of many related measurements.

The EKG model we discussed was an example of a system with a continuous signal having a single value at any point in time. A reconstruction-model approach can also be used for a different type of system, one with multiple measurements being made at a single point in time. An example of a multidimensional system like this is a community water supply system. With the increasing use of sensors in such systems that report many different system parameters, you might encounter thousands of measurements for a single point in time. These measurements could include flow or pressure at multiple locations in the system, or depth measurements for reservoirs. Although this type of system has a huge number of data points at any single time stamp, it does not exhibit the complex dynamics of the heartbeat/EKG example. To make a probabilistic model of such a multidimensional water system, you might use a fairly sophisticated approach such as neural nets or a physics-based model, but for anomaly detection, you can still use a reconstruction error–based technique.

Looking Toward Modeling More Problematic Inputs

Where this approach really begins to break down a bit is with inputs that are not based on a measurement of a value such as voltage, pressure, or flow. A good example of such a problematic input can be found in the log files from a website. In that case, the input consists of events that have a time of occurrence and some kind of label. Even though the methods described in this chapter can’t be applied verbatim, the basic idea of a probabilistic model is still the key to good anomaly detection. The use of an adaptive, probabilistic model for e-commerce log files is the topic of the next chapter.

Get Practical Machine Learning: A New Look at Anomaly Detection now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Practical Machine Learning: A New Look at Anomaly Detection by Ted Dunning, Ellen Friedman

Chapter 4. More Complex, Adaptive Models

Windows and Clusters

Matches with the Windowed Reconstruction: Normal Function

Mismatches with the Windowed Reconstruction: Anomalous Function

A Powerful But Simple Technique

Looking Toward Modeling More Problematic Inputs

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly