Data Science Essentials in Python

Unit 50

Grouping Data with K-Means Clustering

Clustering is an unsupervised machine learning technique. You do not need to (and cannot!) train the model.

The goal of clustering is to collect samples (represented as n-dimensional vectors of real numbers) into disjoint compact groups with good internal proximity. For clustering to work, the vector dimensions must have reasonably compatible ranges. If the range of one dimension is much higher or much lower than the ranges of the other dimensions, you should scale the variables that are “too tall” or “too short” before clustering.

The k-means clustering aggregates samples into k clusters (hence the name) according to the following algorithm:

Randomly choose k vectors as the initial centroids ...

Get Data Science Essentials in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Science Essentials in Python by Dmitry Zinoviev

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly