CHAPTER 7

MULTI VARIATE TECHNIQUES WITH TEXT

7.1 INTRODUCTION

Data is collected by taking measurements. These are often unpredictable, for one of two reasons. First, there is measurement error, and, second, the objects measured can be ran– domly selected as discussed in section 6.2. Random variables can model either type of unpredictability.

Text mining often analyzes multiple texts. For example, there are 68 Edgar Allan Poe short stories, all of which are of interest to a literary critic. A more extreme example is a researcher analyzing the EnronSent corpus. This has about a 100 megabytes of text, which translates into a vast number of emails.

Hence, a text miner often has many variables to analyze simultaneously. There are a number of techniques for this situation, which are collectively called multivariate statistics. This chapter introduces one of these, principal components analysis (PCA).

This chapter focuses on applications, and some of the key ideas of PCA are introduced. The goal is not to explain all the details, but to give some idea about how it works, and how to apply it to texts. Specifically, 68 Poe short stories are analyzed. These are from a five volume collected work that is in the public domain and is available online ([96], [97], [98], [99] and [100]).

These 68 stories are “The Unparalleled Adventures of One Hans Pfaall," “The Gold Bug," “Four Beasts in One," “The Murders in the Rue Morgue," “The Mystery of Marie Rogêt," “The Balloon Hoax," “MS. Found in a ...

Get Practical Text Mining with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.