Chapters 2 through 8 all have a theme. For example, regular expressions and data structures underlie chapters 2 and 3, respectively, and chapter 8 focuses on clustering. This one, however, covers three topics in less detail. The goal is to give the interested reader a few parting ideas as well as a few references for text mining.


Not only is Perl free, there are a vast number of free packages already written for Perl. Because the details of obtaining these depends on the operating system, see The Comprehensive Perl Archive Network (CPAN) Web site [54] for instructions on how to download them.

Perl packages are called Perl modules, which are grouped together by topic. Each name typically has two or three parts, which are separated by double colons. The first part usually denotes a general topic, for example, Lingua, String, and Text. The second part is either a subtopic or a specific module. For instance, Lingua’s subtopics are often specific languages; for example, Lingua: : EN for English and Lingua: : DE for German (DE stands for Deutsch). Our first example is from the former.

9.2.1 Modules for Number Words

Lingua: : EN: : Numbers [21] has a three-part name, where Lingua stands for language, and EN for English. The third part states what the package does in particular, and in this case, it involves numbers.

CPAN gives information about each module and tells us there are two functions ...

Get Practical Text Mining with Perl now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.