An imbalanced and multiclass classification problem

Given some descriptors of a sequence of packets, flowing to/from a host connected to the Internet, the goal of this problem is to detect whether that sequence signals a malicious attack or not. If it does, we should also classify the type of attack. That's a multiclass classification problem, since the possible labels are multiple ones.

For each observation, 42 features are revealed: please note that some of them are categorical, whereas others are numerical. The dataset is composed of almost 5 million observations (but in this exercise we're using just the first million, to avoid memory constraints), and the number of possible labels is 23. One of them represents a non-malicious situation (

Get Regression Analysis with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.