Anomaly detection with the KDD Cup 99 dataset

This example is based on the KDD Cup 99 dataset, which collects a long series of normal and malicious internet activities. In particular, we are going to focus on the subset of HTTP requests, which has four attributes: duration, source bytes, destination bytes, and behavior (which is more a classification element, but it's helpful for us to have immediate access to some specific attacks). As the original values were very small numbers around zero, all versions (included the scikit-learn one) renormalize the variables, using the formula log(x + 0.1) (hence, it must be applied when simulating the anomaly detection with new samples). Of course, the inverse transformation is as follows:

Let's start ...

Get Hands-On Unsupervised Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.