book

Practical Text Mining with Perl

Name: Practical Text Mining with Perl
Author: Roger Bilisoly
ISBN: 9780470176436

by Roger Bilisoly

August 2008

Intermediate to advanced

320 pages

9h 14m

English

Wiley

Read now

Unlock full access

COVER
SERIES TITLE
TITLE
COPYRIGHT PAGE
DEDICATION
LIST OF FIGURES
LIST OF TABLES
PREFACE
ACKNOWLEDGMENTS
CHAPTER 1: INTRODUCTION
1.1 OVERVIEW OF THIS BOOK1.2 TEXT MINING AND RELATED FIELDS1.3 ADVICE FOR READING THIS BOOK

CHAPTER 2: TEXT PATTERNS
2.1 INTRODUCTION2.2 REGULAR EXPRESSIONS2.3 FINDING WORDS IN A TEXT2.4 DECOMPOSING POE’S “THE TELL-TALE HEART” INTO WORDS2.5 A SIMPLE CONCORDANCE2.6 FIRST ATTEMPT AT EXTRACTING SENTENCES2.7 REGEX ODDS AND ENDS2.8 REFERENCESPROBLEMS
CHAPTER 3: QUANTITATIVE TEXT SUMMARIES
3.1 INTRODUCTION3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL3.3 ARRAYS AND CONTEXT IN PERL3.4 WORD LENGTHS IN POE’S “THE TELL-TALE HEART”3.5 ARRAYS AND FUNCTIONS3.6 HASHES3.7 TWO TEXT APPLICATIONS3.8 COMPLEX DATA STRUCTURES3.9 REFERENCES3.10 FIRST TRANSITIONPROBLEMS
CHAPTER 4: PROBABILITY AND TEXT SAMPLING
4.1 INTRODUCTION4.2 PROBABILITY4.3 CONDITIONAL PROBABILITY4.4 MEAN AND VARIANCE OF RANDOM VARIABLES4.5 THE BAG-OF-WORDS MODEL FOR POE’S “THE BLACK CAT”4.6 THE EFFECT OF SAMPLE SIZE4.7 REFERENCESPROBLEMS
CHAPTER 5: APPLYING INFORMATION RETRIEVAL TO TEXT MINING
5.1 INTRODUCTION5.2 COUNTING LETTERS AND WORDS5.3 TEXT COUNTS AND VECTORS5.4 THE TERM-DOCUMENT MATRIX APPLIED TO POE5.5 MATRIX MULTIPLICATION5.6 FUNCTIONS OF COUNTS5.7 DOCUMENT SIMILARITY5.8 REFERENCESPROBLEMS
CHAPTER 6: CONCORDANCE LINES AND CORPUS LINGUISTICS
6.1 INTRODUCTION6.2 SAMPLING6.3 CORPUS AS BASELINE6.4 CONCORDANCING6.5 COLLOCATIONS AND CONCORDANCE LINES6.6 APPLICATIONS WITH REFERENCES6.7 SECOND TRANSITIONPROBLEMS
CHAPTER 7: MULTI VARIATE TECHNIQUES WITH TEXT
7.1 INTRODUCTION7.2 BASIC STATISTICS7.3 BASIC LINEAR ALGEBRA7.4 PRINCIPAL COMPONENTS ANALYSIS7.5 TEXT APPLICATIONS7.6 APPLICATIONS AND REFERENCESPROBLEMS
CHAPTER 8: TEXT CLUSTERING
8.1 INTRODUCTION8.2 CLUSTERING8.3 A NOTE ON CLASSIFICATION8.4 REFERENCES8.5 LAST TRANSITIONPROBLEMS
CHAPTER 9: A SAMPLE OF ADDITIONAL TOPICS
9.1 INTRODUCTION9.2 PERL MODULES9.3 OTHER LANGUAGES: ANALYZING GOETHE IN GERMAN9.4 PERMUTATION TESTS9.5 REFERENCES
APPENDIX A: OVERVIEW OF PERL FOR TEXT MINING
A.1 BASIC DATA STRUCTURESA.2 OPERATORSA.3 BRANCHING AND LOOPINGA.4 A FEW PERL FUNCTIONSA.5 INTRODUCTION TO REGULAR EXPRESSIONS
APPENDIX B: SUMMARY OF R USED IN THIS BOOK
B.1 BASICS OF RB.2 THIS BOOK’S R CODE
REFERENCES
INDEX

Content preview from Practical Text Mining with Perl

CHAPTER 2

TEXT PATTERNS

2.1 INTRODUCTION

Did you ever remember a certain passage in a book but forgot where it was? With the advent of electronic texts, this unpleasant experience has been replaced by the joy of using a search utility. Computers have limitations, but their ability to do what they are told without tiring is invaluable when it comes to combing through large electronic documents. Many of the more sophisticated techniques later in this book rely on an initial analysis that starts with one or more searches.

Before beginning with text patterns, consider the following question. Since humans are experts at understanding text, and, at present, computers are essentially illiterate, can a procedure as simple as a search really find something unexpected to a human? Yes, it can, and here is an example. Anyone fluent in English knows that the precedes its noun, so the following sentence is clearly ungrammatical.

(2.1) Dog the is hungry.

Putting the the before the noun corrects the problem, so sentence 2.2 is correct.

(2.2) The dog is hungry.

A systematically collected sample of text is called a corpus (its plural form is corpora), and large corpora have been collected to study language. For example, the Cambridge International Corpus has over 800 million words and is used in Cambridge University Press language reference books [26]. Since a book has roughly 500 words on a page, this corresponds to roughly 1.6 million pages of text. In such a corpus, is it possible to find a ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118210505Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Text Mining with Perl

by Roger Bilisoly

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.