Reading the dataset

First of all, let's download and decompress the dataset. We will be very conservative and use just 10% of the original training dataset (75 MB, uncompressed), as all our analysis is run on a small virtual machine. If you want to give it a try, you can uncomment the lines in the following snippet of code and download the full training dataset (750 MB uncompressed). We download the training dataset, testing (47 MB), and feature names, using bash commands:

In: !mkdir datasets    !rm -rf ./datasets/kdd*    # !wget -q -O datasets/kddtrain.gz \    # http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz    !wget -q -O datasets/kddtrain.gz \    http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz !wget -q -O datasets/kddtest.gz ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.