This example is based on the KDD Cup 99 dataset, which collects a long series of normal and malicious internet activities. In particular, we are going to focus on the subset of HTTP requests, which has four attributes: duration, source bytes, destination bytes, and behavior (which is more a classification element, but it's helpful for us to have immediate access to some specific attacks). As the original values were very small numbers around zero, all versions (included the scikit-learn one) renormalize the variables, using the formula log(x + 0.1) (hence, it must be applied when simulating the anomaly detection with new samples). Of course, the inverse transformation is as follows:
Let's start ...