CHAPTER 5Textual Abstraction: Latent Structure, Dimension Reduction
TEXT MINING DATA SOURCE ASSEMBLY
One of the lively areas in text analytics is the area of topic modeling. This includes topic modeling techniques like latent semantic analysis (LSA)i, latent Dirichlet allocation (LDA) as popularized by Blei,ii and the SAS approach to text topics described by Cox.iii These approaches employ a variety of statistical techniques to detect the underlying dimensionality in collections of textual data in order to infer the common topical content that is driving the observed behavior of the text.
LATENT STRUCTURE AND DIMENSIONAL REDUCTION
A classic discussion of using linear products to compress collections of text documents encoded as matrix representations is given by Albrightiv. The collection can be expressed as a word x document representation that can be manipulated and summarized using a range of matrix manipulation approaches, drawn from linear algebra. The approach that employs singular value decomposition is discussed here.
To set the stage, Albright used a set of documents collected from various diagnostics issued by the SAS processor. Each message is treated as a separate document:
- Error: Invalid message file format
- Error: Unable to open message file using message path
- Error: Unable to format variable
The way the terms and documents are represented to facilitate computation is as a term by document matrix. This is shown in Table 5.1.
Get Text as Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.