Chapter 1

Probabilistic Models for Information Retrieval 1

In this chapter, we wish to present the main probabilistic models for information retrieval. We recall that an information retrieval system is characterized by three components which are as follows:

1) a module for indexing queries;

2) a module for indexing documents;

3) a module for matching documents and queries.

Here, we are not interested in the indexing modules, which are the subjects of development elsewhere (see for example [SAV 10]). We are interested only in the matching module. In addition, among all the information retrieval models, we will concentrate only on the probabilistic models, as they are considered to be the strongest performers in information retrieval and have been the subject of a large number of developments over recent years.

1.1. Introduction

Information Retrieval (IR) organizes collections of documents and responds to user queries by supplying a list of documents which are deemed relevant for the user’s requirements. In contrast to databases, (a) information retrieval systems process non-structured information, such as the contents of text documents, and (b) they fit well within a probabilistic framework, which is generally based on the following assumption:

Assumption 1. The words and their frequency in a single document or a collection of documents can be considered as random variables. Thus, it is possible to observe the frequency of a word in a corpus and to study it as a random phenomenon. ...

Get Textual Information Access: Statistical Models now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.