O'Reilly logo

Practical Text Mining with Perl by Roger Bilisoly

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 7

MULTI VARIATE TECHNIQUES WITH TEXT

7.1 INTRODUCTION

Data is collected by taking measurements. These are often unpredictable, for one of two reasons. First, there is measurement error, and, second, the objects measured can be ran– domly selected as discussed in section 6.2. Random variables can model either type of unpredictability.

Text mining often analyzes multiple texts. For example, there are 68 Edgar Allan Poe short stories, all of which are of interest to a literary critic. A more extreme example is a researcher analyzing the EnronSent corpus. This has about a 100 megabytes of text, which translates into a vast number of emails.

Hence, a text miner often has many variables to analyze simultaneously. There are a number of techniques for this situation, which are collectively called multivariate statistics. This chapter introduces one of these, principal components analysis (PCA).

This chapter focuses on applications, and some of the key ideas of PCA are introduced. The goal is not to explain all the details, but to give some idea about how it works, and how to apply it to texts. Specifically, 68 Poe short stories are analyzed. These are from a five volume collected work that is in the public domain and is available online ([96], [97], [98], [99] and [100]).

These 68 stories are “The Unparalleled Adventures of One Hans Pfaall," “The Gold Bug," “Four Beasts in One," “The Murders in the Rue Morgue," “The Mystery of Marie Rogêt," “The Balloon Hoax," “MS. Found in a ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required