Cluster Analysis

Cluster analysis is a set of techniques that look for groups (clusters) in the data. Objects belonging to the same group resemble each other. Objects belonging to different groups are dissimilar. Sounds simple, doesn't it? The problem is that there is usually a huge amount of redundancy in the explanatory variables. It is not obvious which measurements (or combinations of measurements) will turn out to be the ones that are best for allocating individuals to groups. There are three ways of carrying out such allocation:

  • partitioning into a number of clusters specified by the user, with functions such as kmeans
  • hierarchical, starting with each individual as a separate entity and ending up with a single aggregation, using functions such as hclust
  • divisive, starting with a single aggregate of all the individuals and splitting up clusters until all the individuals are in different groups.

Partitioning

The kmeans function operates on a dataframe in which the columns are variables and the rows are the individuals. Group membership is determined by calculating the centroid for each group. This is the multidimensional equivalent of the mean. Each individual is assigned to the group with the nearest centroid. The kmeans function fits a user-specified number of cluster centres, such that the within-cluster sum of squares from these centres is minimized, based on Euclidian distance. Here are data from six groups with two continuous explanatory variables (x and y):

kmd<-read.table("c:\\temp\\kmeansdata.txt",header=T) ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.