An imbalanced and multiclass classification problem

Given some descriptors of a sequence of packets, flowing to/from a host connected to the Internet, the goal of this problem is to detect whether that sequence signals a malicious attack or not. If it does, we should also classify the type of attack. That's a multiclass classification problem, since the possible labels are multiple ones.

For each observation, 42 features are revealed: please note that some of them are categorical, whereas others are numerical. The dataset is composed of almost 5 million observations (but in this exercise we're using just the first million, to avoid memory constraints), and the number of possible labels is 23. One of them represents a non-malicious situation (

Get Regression Analysis with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.