October 2012
Beginner to intermediate
721 pages
21h 38m
English
Another important data mining technique is clustering. Clustering is a way to find similar sets of observations in a data set; groups of similar observations are called clusters. There are several functions available for clustering in R.
To effectively use clustering algorithms, you need to
begin by measuring the distance between observations. A convenient way
to do this in R is through the function dist in the stats
package:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
The dist function computes the
distance between pairs of objects in another object, such as a matrix or
a data frame. It returns a distance matrix (an object of type dist) containing the computed distances. Here
is a description of the arguments to dist.
| Argument | Description | Default |
|---|---|---|
| x | The object on which to compute distances. Must be a data
frame, matrix, or dist
object. | |
| method | The method for computing distances. Specify method="euclidean" for Euclidean
distances (2-norm), method="maximum" for the maximum
distance between observations (supremum norm), method="manhattan" for the absolute
distance between two vectors (1-norm), method="canberra" for Canberra
distances (see the help file), method="binary" to regard nonzero
values as 1 and zeros as 0, or method="minkowski" to use the
p-norm (the pth root
of the sum of the pth powers of the
differences of the components). | "euclidean" |
| diag | A logical value specifying whether the diagonal of the
distance matrix should be printed by print.dist ... |