Clustering

What are the typical types of people in our data? Clustering is a powerful statistical method to find this sort of pattern. A clustering algorithm splits data points into several characteristic classes by grouping together similar instances. There are many methods for clustering, but one of the most popular and simple methods is called k-means. In k-means, each cluster has a center point, a "centroid." Several different centroids are found in the data and each data point is assigned to a centroid. The algorithm iteratively adjusts the clusters so that as many data points as possible are close to their assigned centroids.

In our data set, each face has about 20 numeric attributes. Thus, faces are points in a 20-dimensional space. K-means will place faces into several different clusters within that space, trying to select clusters where faces are as similar to their cluster's center as possible.

One unfortunate aspect about k-means clustering is that you have to pick a fixed number of clusters, "k", upfront. However, there isn't an obvious way to choose the number of clusters. The best thing to do is to try a few different numbers and see what patterns emerge. Here's one run of k-means we did that gave reasonable output:

Preprocess the data,                   > norm_data = apply(d, 2, function(x) {
by changing missing values to the mean,       x[is.na(x)] = mean(x, na.rm=TRUE)
and unit-normalizing values,                  x = (x - mean(x)) / sd(x)
which usually makes k-means work better.      x })
Then run k-means for ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.