Text clustering
Clustering is an unsupervised learning technique. Intuitively, clustering groups objects into disjoint sets. We do not know how many groups exist in the data, or what might be the commonality within these groups (clusters).
Text clustering has several applications. For example, an organizational entity may want to organize its internal documents into similar clusters based on some similarity measure. The notion of similarity or distance is central to the clustering process. Common measures used are TF-IDF and cosine similarity. Cosine similarity, or the cosine distance, is the cos product of the word frequency vectors of two documents. Spark provides a variety of clustering algorithms that can be effectively used in text analytics. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access