## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

CHAPTER 5

APPLYING INFORMATION RETRIEVAL TO TEXT MINING

5.1 INTRODUCTION

Information retrieval (IR) is the task of returning relevant texts for a query. The most famous application is the online search engine where the texts are Web pages. The basic underlying concept is simple: a measure of similarity is computed between the query and each document, which are then sorted from most to least relevant.

The details of search engines are more complex, of course. For example, Web pages must be found and indexed prior to any queries. For an introduction to this, see chapter 1 of Data Mining the Web by Markov and Larose [77]. For details of how the computations are made, see Google’s PageRank and Beyond by Langville and Meyer [68].

We are interested in using the similarity scores from IR to compare two texts. With these scores a number of statistical techniques can be employed, for example, clustering, the topic of chapter 8.

IR has a number of approaches, and we consider only one: the vector space model. Vector space is a term from linear algebra, but our focus is the specific application of this model to texts, and all the required mathematics is introduced in this chapter. This includes geometric ideas such as angles.

5.2 COUNTING LETTERS AND WORDS

To keep the focus on text, not mathematics, we study the distribution of third-person pro­ nouns by gender in four Edgar Allan Poe short stories. Section 4.6.1 shows that the length of a text influences the estimates, so these four stories ...

## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

No credit card required