This chapter discusses how to partition a collection of texts into groups, which is called clustering. For example, a researcher analyzes a corpus of emails to find subsets having common themes. These are not known beforehand and are determined as part of the analysis. A related task called classification also partitions texts into groups, but these are known prior to the analysis. For example, there are commercial programs that classify incoming emails as either spam or nonspam.
These two tasks need different types of information. First, if the groups are unknown prior to the analysis, then a quantitative similarity measure is required that can be applied to any two documents. This approach is called unsupervised because computing similarities can be done by the program without human intervention.
Second, if the groups are known beforehand, then the algorithm requires training data that includes the correct group assignments. For example, developing a spam program requires training the algorithm with emails that are correctly labeled. A human provides these, so this approach is called supervised. However, creating or purchasing training data requires resources.
Because classification needs training data, which typically does not exist in the public domain, this chapter focuses on clustering. This only requires texts and an algorithm. As seen earlier in this book, there are plenty of the former available on the Web, and the latter exists ...