Let's do some analytics with k-means clustering

Anomalous data refers to data that is unusual from normal distributions. Thus, detecting anomalies is an important task for network security, anomalous packets or requests can be flagged as errors or potential attacks.

In this example, we will use the KDD-99 dataset (can be downloaded here: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html ). A number of columns will be filtered out based on certain criteria of the data points. This will help us understand the example. Secondly, for the unsupervised task; we will have to remove the labeled data. Let's load and parse the dataset as simple texts. Then let's see how many rows there are in the dataset:

INPUT = "C:/Users/rezkar/Downloads/kddcup.data" ...

Get Scala and Spark for Big Data Analytics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.