Although situated at the end, this last chapter is meant as an introduction to the models presented in the rest of the book. It introduces, with all the necessary details, the simplest probabilistic models used in the statistical processing of text collections. A good comprehension of these models will be required to follow the developments and more advanced issues presented in this book. It has therefore been thought useful to cover this more basic material in this Appendix, so as to make this book entirely self-contained. Alternatively, the reader may refer to some excellent textbooks covering similar topics, such as [CHA 93, MAN 99, JUR 00]. This chapter is also an opportunity to lay down some notations and to formulate some of the questions that appear throughout the entire book: How to define a probability distribution over a set of documents? Over a set of sentences? and so on.

This chapter is organized in five sections: in the first (section A.2), several instantiations of the simplest supervised categorization model1, the so-called naive Bayes model, are presented. For a good comprehension of these models it is useful to read Chapter 1, which cover related models, in the context of information retrieval applications. Chapters 3 and 4 are devoted to more sophisticated (and often more accurate) models for categorization tasks. Section A.3 then introduces unsupervised learning problems, through the study of ...

Start Free Trial

No credit card required