CHAPTER 5Textual Abstraction: Latent Structure, Dimension Reduction

TEXT MINING DATA SOURCE ASSEMBLY

One of the lively areas in text analytics is the area of topic modeling. This includes topic modeling techniques like latent semantic analysis (LSA)ⁱ, latent Dirichlet allocation (LDA) as popularized by Blei,ⁱⁱ and the SAS approach to text topics described by Cox.ⁱⁱⁱ These approaches employ a variety of statistical techniques to detect the underlying dimensionality in collections of textual data in order to infer the common topical content that is driving the observed behavior of the text.

LATENT STRUCTURE AND DIMENSIONAL REDUCTION

A classic discussion of using linear products to compress collections of text documents encoded as matrix representations is given by Albright^iv. The collection can be expressed as a word x document representation that can be manipulated and summarized using a range of matrix manipulation approaches, drawn from linear algebra. The approach that employs singular value decomposition is discussed here.

To set the stage, Albright used a set of documents collected from various diagnostics issued by the SAS processor. Each message is treated as a separate document:

Error: Invalid message file format
Error: Unable to open message file using message path
Error: Unable to format variable

The way the terms and documents are represented to facilitate computation is as a term by document matrix. This is shown in Table 5.1.

Table 5.1 Example Term by Document ...

Get Text as Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Text as Data by Barry DeVille, Gurpreet Singh Bawa

CHAPTER 5Textual Abstraction: Latent Structure, Dimension Reduction

TEXT MINING DATA SOURCE ASSEMBLY

LATENT STRUCTURE AND DIMENSIONAL REDUCTION

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly