Chapter 13
Finding Islands of Similarity: Automatic Cluster Detection
This is the first of two chapters about finding islands of similarity in complex data sets. This chapter focuses on the most general of the automatic clustering techniques, k-means clustering, and focuses on practical applications. The next chapter dives into more detail on several other techniques.
Why is cluster detection useful? The patterns found by data mining are not always immediately forthcoming. Sometimes this is because there are no patterns to be found. Other times, the problem is not the lack of patterns, but the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns. When mining such data for the answer to a specific question, competing explanations might cancel each other out. As with radio reception, too many competing signals add up to noise. Low prices stimulate purchases in one customer segment, but make the product seem less appealing to another. In these situations, cluster detection — an undirected technique — can be of assistance. Cluster detection provides a way to learn about the structure of complex data; to break up the cacophony of competing signals into simpler components.
When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply. What do trees look like? It is a hard question to answer, not ...