Chapter 5

Topic-based Generative Models for Text Information Access 1

5.1. Introduction

In this chapter, generative models of text documents are presented. They can either be used to classify texts (in a priori known classes/labels) or to cluster them (into groups, not known a priori). As presented in Appendix A, the only difference between classification/categorization and clustering comes from the data available for learning. In the case of classification, (document, class) couples are considered – this is called supervised learning, whereas in the case of clustering, only single documents are considered – this is called unsupervised learning. Semi-supervised learning also exists, where only a sub-part of the learning data is associated with a class [CHA 06b, ZHU 09]. From here in, the generic term “categorization” will be used for all of these situations.

Numerous generative models exist for text categorization [SEB 02, ZHO 05], but here we focus on the most successful of the most recent models (last decade): the Topic Models, also known as “latent semantic-based models”, or “discrete principal component analysis” [BUN 06, STE 07, BLE 09].

5.1.1. Generative versus discriminative models

Generative and discriminative models (see Chapters 4 and 6) share the same framework, which can be described in general terms by two random variables X and Y, one of which (X) is observed, and the other (Y) is assumed or hidden, latent. These models differ, however, in their objective: generative ...

Get Textual Information Access: Statistical Models now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.