Chapter 5. Anomaly Detection with K-means Clustering

Classification and regression are powerful, well-studied techniques in machine learning. Chapter 4 demonstrated using a classifier as a predictor of unknown values. But there was a catch: to predict unknown values for new data, we had to know the target values for many previously seen examples. Classifiers can help only if we, the data scientists, know what we are looking for and can provide plenty of examples where input produced a known output. These were collectively known as supervised learning techniques, because their learning process receives the correct output value for each example in the input.

However, sometimes the correct output is unknown for some or all examples. Consider the problem of dividing up an ecommerce site’s customers by their shopping habits and tastes. The input features are their purchases, clicks, demographic information, and more. The output should be groupings of customers: perhaps one group will represent fashion-conscious buyers, another will turn out to correspond to price-sensitive bargain hunters, and so on.

If you were asked to determine this target label for each new customer, you would quickly run into a problem in applying a supervised learning technique like a classifier: you don’t know a priori who should be considered fashion-conscious, for example. In fact, you’re not even sure if “fashion-conscious” is a meaningful grouping of the site’s customers to begin with!

Fortunately, unsupervised ...

Get Advanced Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.