**CHAPTER 5**

**APPLYING INFORMATION RETRIEVAL TO TEXT MINING**

**5.1 INTRODUCTION**

Information retrieval (IR) is the task of returning relevant texts for a query. The most famous application is the online search engine where the texts are Web pages. The basic underlying concept is simple: a measure of similarity is computed between the query and each document, which are then sorted from most to least relevant.

The details of search engines are more complex, of course. For example, Web pages must be found and indexed prior to any queries. For an introduction to this, see chapter 1 of *Data Mining the Web* by Markov and Larose [77]. For details of how the computations are made, see *Google’s PageRank and Beyond* by Langville and Meyer [68].

We are interested in using the similarity scores from IR to compare two texts. With these scores a number of statistical techniques can be employed, for example, clustering, the topic of chapter 8.

IR has a number of approaches, and we consider only one: the *vector space model. Vector space* is a term from linear algebra, but our focus is the specific application of this model to texts, and all the required mathematics is introduced in this chapter. This includes geometric ideas such as angles.

**5.2 COUNTING LETTERS AND WORDS**

To keep the focus on text, not mathematics, we study the distribution of third-person pro nouns by gender in four Edgar Allan Poe short stories. Section 4.6.1 shows that the length of a text influences the estimates, so these four stories ...