CHAPTER 3

QUANTITATIVE TEXT SUMMARIES

3.1 INTRODUCTION

There are a number of text mining techniques, many of which require counts of text patterns as their starting point. The last chapter introduces regular expressions, a methodology to describe patterns, and this chapter shows how to count the matches.

As noted in section 2.6, literary texts consist of tokens, most of which are words. One useful task is counting up the number of times each distinct token appears, that is, finding the frequency of types. For example, sentence 3.1 has five tokens but only four types because the word the appears twice, while the rest appear only once.

(3.1) The cat ate the bird.

Although counting four types at once is not hard, it requires deeper knowledge of Perl to count thousands of patterns simultaneously. We begin this chapter by learning enough Perl to do this.

3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL

We have already encountered scalar variables, which start with a dollar sign and store exactly one value. Several examples are given in code sample 3.1, which also contrasts the usage of single and double quotes for strings. First, notice that $a and $b have the same value because it does not matter which type of quote marks are used to specify a specific string. However, $g and $h are different because when $a is in double quotes, it is replaced by its value, but this is not true with single quotes. This is another example of interpolation (see the discussion near code sample 2.2).

Get Practical Text Mining with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.