4

Advanced Analytical Theory and Methods: Clustering

Key Concepts

Centroid

Clustering

K-means

Unsupervised

Within Sum of Squares

Building upon the introduction to R presented in Chapter 3, “Review of Basic Data Analytic Methods Using R,” Chapter 4, “Advanced Analytical Theory and Methods: Clustering” through Chapter 9, “Advanced Analytical Theory and Methods: Text Analysis” describe several commonly used analytical methods that may be considered for the Model Planning and Execution phases (Phases 3 and 4) of the Data Analytics Lifecycle. This chapter considers clustering techniques and algorithms.

4.1 Overview of Clustering

In general, clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data. Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. The structure of the data describes the objects of interest and determines how best to group the objects. For example, based on customers' personal income, it is straightforward to divide the customers into three groups depending on arbitrarily selected values. The customers could be divided into three groups as follows:

  • Earn less than $10,000
  • Earn between $10,000 and $99,999
  • Earn $100,000 or more

In this case, the income levels were chosen somewhat subjectively based on easy-to-communicate points of delineation. However, ...

Get Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.