O'Reilly logo

Practical Text Mining with Perl by Roger Bilisoly

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 3

QUANTITATIVE TEXT SUMMARIES

3.1 INTRODUCTION

There are a number of text mining techniques, many of which require counts of text patterns as their starting point. The last chapter introduces regular expressions, a methodology to describe patterns, and this chapter shows how to count the matches.

As noted in section 2.6, literary texts consist of tokens, most of which are words. One useful task is counting up the number of times each distinct token appears, that is, finding the frequency of types. For example, sentence 3.1 has five tokens but only four types because the word the appears twice, while the rest appear only once.

(3.1) The cat ate the bird.

Although counting four types at once is not hard, it requires deeper knowledge of Perl to count thousands of patterns simultaneously. We begin this chapter by learning enough Perl to do this.

3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL

We have already encountered scalar variables, which start with a dollar sign and store exactly one value. Several examples are given in code sample 3.1, which also contrasts the usage of single and double quotes for strings. First, notice that $a and $b have the same value because it does not matter which type of quote marks are used to specify a specific string. However, $g and $h are different because when $a is in double quotes, it is replaced by its value, but this is not true with single quotes. This is another example of interpolation (see the discussion near code sample 2.2).

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required