Finding an optimal number of clusters for k-means

Often, you will not know how many clusters you can expect in your data. For two or three-dimensional data, you could plot the dataset in an attempt to eyeball the clusters. However, it becomes harder with a dataset that has many dimensions as, beyond three dimensions, it is impossible to plot the data on one chart.

In this recipe, we will show you how to find the optimal number of clusters for a k-means clustering model. We will be using the Davis-Bouldin metric to assess the performance of our k-means models when we vary the number of clusters. The aim is to stop when a minimum of the metric is found.

Getting ready

In order to execute this, you will need pandas, NumPy, and Scikit. No other prerequisites ...

Get Practical Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.