Document clustering
Document clustering is the process of grouping or partitioning text documents into meaningful groups. The hypothesis of the clustering algorithm is based on minimizing the distance between objects in a cluster, while keeping the intra-cluster distance at maximum.
For example, if we have a collection of news articles and we perform clustering on the collection, we will find that the similar documents are closer to each other and lie in the same cluster.
Some of the commonly used texts clustering methods are as follows:
- Standard methods:
- K-means
- Hierarchical clustering
- Specialized clustering:
- Suffix tree clustering
- Frequent-term set-based ...
Get Mastering Text Mining with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.