Summary
In this chapter, we've learned about the process of clustering and covered the popular k-means clustering algorithm to cluster large numbers of text documents.
This provided an opportunity to cover the specific challenges presented by text processing where data is often messy, ambiguous, and high-dimensional. We saw how both stop words and stemming can help to reduce the number of dimensions and how TF-IDF can help identify the most important dimensions. We also saw how n-grams and shingling can help to tease out context for each word at the cost of a vast proliferation of terms.
We've explored Parkour in greater detail and seen how it can be used to write sophisticated, scalable, Hadoop jobs. In particular, we've seen how to make use of ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access