December 2018
Intermediate to advanced
318 pages
8h 28m
English
In k-means clustering, since all data points are not measured on the same scale, they have a high variance. This leads to clusters being less spherical. The uneven variance leads to putting to more weights on variables that will have a lower variance.
To fix this bias, we need to normalize our data, specially because we use Euclidean distance that ends up influencing clusters that have variables with a bigger magnitude. We fix this by standardizing the score of all variables. This is achieved by subtracting the average of the variable's value from each value and followed by a division with standard deviation.
We normalize our data using this same calculation:
def normalize(thedata): n = thedata.count() avg = thedata.reduce(add) ...
Read now
Unlock full access