CHAPTER 5
APPLYING INFORMATION RETRIEVAL TO TEXT MINING
5.1 INTRODUCTION
Information retrieval (IR) is the task of returning relevant texts for a query. The most famous application is the online search engine where the texts are Web pages. The basic underlying concept is simple: a measure of similarity is computed between the query and each document, which are then sorted from most to least relevant.
The details of search engines are more complex, of course. For example, Web pages must be found and indexed prior to any queries. For an introduction to this, see chapter 1 of Data Mining the Web by Markov and Larose [77]. For details of how the computations are made, see Google’s PageRank and Beyond by Langville and Meyer [68].
We are interested in using the similarity scores from IR to compare two texts. With these scores a number of statistical techniques can be employed, for example, clustering, the topic of chapter 8.
IR has a number of approaches, and we consider only one: the vector space model. Vector space is a term from linear algebra, but our focus is the specific application of this model to texts, and all the required mathematics is introduced in this chapter. This includes geometric ideas such as angles.
5.2 COUNTING LETTERS AND WORDS
To keep the focus on text, not mathematics, we study the distribution of third-person pro nouns by gender in four Edgar Allan Poe short stories. Section 4.6.1 shows that the length of a text influences the estimates, so these four stories ...