CHAPTER 4

PROBABILITY AND TEXT SAMPLING

4.1 INTRODUCTION

Chapters 2 and 3 introduce and explain many features of Perl. Starting at this point, however, the focus shifts to using it for text analyses. Where new features are introduced, these are noted and explained, but the emphasis is on the texts.

This chapter focuses on some of the statistical properties of text. Unfortunately, some of the common assumptions used in popular statistical techniques are not applicable, so care is needed. This situation is not surprising because language is more complex than, for example, flipping a coin.

We start off with an introduction to the basics of probability. This discussion focuses on the practical, not the theoretical, and all the examples except the first involve text, keeping in the spirit of this book.

4.2 PROBABILITY

Probability models variability. If a process is repeated, and if the results are not all the same, then a probabilistic approach can be useful. For example, all gambling games have some element of unpredictability, although the amount of this varies. For example, flipping a fair coin once is completely unpredictable. However, when flipping a hundred coins, there is a 95% probability that the percentage of heads is between 40% and 60%.

Language has both structure and variability. For example, in this chapter, what word appears last, just before the start of the exercises? Although some words are more likely than others, and given all the text up to this point, it is still ...

Get Practical Text Mining with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.