Clustering
Another important data mining technique is clustering. Clustering is a way to find similar sets of observations in a data set; groups of similar observations are called clusters. There are several functions available for clustering in R.
Distance Measures
To effectively use clustering algorithms, you need to
begin by measuring the distance between observations. A convenient
way to do this in R is through the function dist
in the stats
package:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
The dist
function computes
the distance between pairs of objects in another object, such as as
matrix or a data frame. It returns a distance matrix (an object of
type “dist”) containing the computed distances. Here is a
description of the arguments to dist
.
Argument | Description | Default |
---|---|---|
x | The object on which to compute distances. Must be a data frame, matrix, or “dist” object. | |
method | The method for computing distances. Specify method="euclidean" for Euclidean
distances (2-norm), method="maximum" for the maximum
distance between observations (supremum norm), method="manhattan" for the
absolute distance between two vectors (1-norm), method="canberra" for Canberra
distances (see the help file), method="binary" to regard nonzero
values as 1 and zeros as 0, or method="minkowski" to use the
p-norm (the pth
root of the sum of the pth powers of
the differences of the components). | “euclidean” |
diag | A logical value specifying whether the diagonal of the distance matrix should be printed by ... |
Get R in a Nutshell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.