November 2019
Intermediate to advanced
346 pages
9h 36m
English
We start by reading the KDD cup dataset into a data frame. Next, in Step 2, we examine our data, to see that a majority of the traffic is normal, as expected, but a small amount is abnormal. Evidently, the problem is highly imbalanced. Consequently, this problem is a promising candidate for an anomaly detection approach. In Steps 3 and 5, we transform all non-normal traffic into a single class, namely, anomalous.
We also make sure to compute the ratio of anomalies to normal observations (Step 4), known as the contamination parameter. This is one of the parameters that facilitates setting of the sensitivity of isolation forest. This is optional, but is likely to improve performance. We split our dataset into normal and anomalous ...