book

Practical Text Mining with Perl

Name: Practical Text Mining with Perl
Author: Roger Bilisoly
ISBN: 9780470176436

by Roger Bilisoly

August 2008

Intermediate to advanced

320 pages

9h 14m

English

Wiley

Read now

Unlock full access

COVER
SERIES TITLE
TITLE
COPYRIGHT PAGE
DEDICATION
LIST OF FIGURES
LIST OF TABLES
PREFACE
ACKNOWLEDGMENTS
CHAPTER 1: INTRODUCTION
1.1 OVERVIEW OF THIS BOOK1.2 TEXT MINING AND RELATED FIELDS1.3 ADVICE FOR READING THIS BOOK

CHAPTER 2: TEXT PATTERNS
2.1 INTRODUCTION2.2 REGULAR EXPRESSIONS2.3 FINDING WORDS IN A TEXT2.4 DECOMPOSING POE’S “THE TELL-TALE HEART” INTO WORDS2.5 A SIMPLE CONCORDANCE2.6 FIRST ATTEMPT AT EXTRACTING SENTENCES2.7 REGEX ODDS AND ENDS2.8 REFERENCESPROBLEMS
CHAPTER 3: QUANTITATIVE TEXT SUMMARIES
3.1 INTRODUCTION3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL3.3 ARRAYS AND CONTEXT IN PERL3.4 WORD LENGTHS IN POE’S “THE TELL-TALE HEART”3.5 ARRAYS AND FUNCTIONS3.6 HASHES3.7 TWO TEXT APPLICATIONS3.8 COMPLEX DATA STRUCTURES3.9 REFERENCES3.10 FIRST TRANSITIONPROBLEMS
CHAPTER 4: PROBABILITY AND TEXT SAMPLING
4.1 INTRODUCTION4.2 PROBABILITY4.3 CONDITIONAL PROBABILITY4.4 MEAN AND VARIANCE OF RANDOM VARIABLES4.5 THE BAG-OF-WORDS MODEL FOR POE’S “THE BLACK CAT”4.6 THE EFFECT OF SAMPLE SIZE4.7 REFERENCESPROBLEMS
CHAPTER 5: APPLYING INFORMATION RETRIEVAL TO TEXT MINING
5.1 INTRODUCTION5.2 COUNTING LETTERS AND WORDS5.3 TEXT COUNTS AND VECTORS5.4 THE TERM-DOCUMENT MATRIX APPLIED TO POE5.5 MATRIX MULTIPLICATION5.6 FUNCTIONS OF COUNTS5.7 DOCUMENT SIMILARITY5.8 REFERENCESPROBLEMS
CHAPTER 6: CONCORDANCE LINES AND CORPUS LINGUISTICS
6.1 INTRODUCTION6.2 SAMPLING6.3 CORPUS AS BASELINE6.4 CONCORDANCING6.5 COLLOCATIONS AND CONCORDANCE LINES6.6 APPLICATIONS WITH REFERENCES6.7 SECOND TRANSITIONPROBLEMS
CHAPTER 7: MULTI VARIATE TECHNIQUES WITH TEXT
7.1 INTRODUCTION7.2 BASIC STATISTICS7.3 BASIC LINEAR ALGEBRA7.4 PRINCIPAL COMPONENTS ANALYSIS7.5 TEXT APPLICATIONS7.6 APPLICATIONS AND REFERENCESPROBLEMS
CHAPTER 8: TEXT CLUSTERING
8.1 INTRODUCTION8.2 CLUSTERING8.3 A NOTE ON CLASSIFICATION8.4 REFERENCES8.5 LAST TRANSITIONPROBLEMS
CHAPTER 9: A SAMPLE OF ADDITIONAL TOPICS
9.1 INTRODUCTION9.2 PERL MODULES9.3 OTHER LANGUAGES: ANALYZING GOETHE IN GERMAN9.4 PERMUTATION TESTS9.5 REFERENCES
APPENDIX A: OVERVIEW OF PERL FOR TEXT MINING
A.1 BASIC DATA STRUCTURESA.2 OPERATORSA.3 BRANCHING AND LOOPINGA.4 A FEW PERL FUNCTIONSA.5 INTRODUCTION TO REGULAR EXPRESSIONS
APPENDIX B: SUMMARY OF R USED IN THIS BOOK
B.1 BASICS OF RB.2 THIS BOOK’S R CODE
REFERENCES
INDEX

Content preview from Practical Text Mining with Perl

CHAPTER 5

APPLYING INFORMATION RETRIEVAL TO TEXT MINING

5.1 INTRODUCTION

Information retrieval (IR) is the task of returning relevant texts for a query. The most famous application is the online search engine where the texts are Web pages. The basic underlying concept is simple: a measure of similarity is computed between the query and each document, which are then sorted from most to least relevant.

The details of search engines are more complex, of course. For example, Web pages must be found and indexed prior to any queries. For an introduction to this, see chapter 1 of Data Mining the Web by Markov and Larose [77]. For details of how the computations are made, see Google’s PageRank and Beyond by Langville and Meyer [68].

We are interested in using the similarity scores from IR to compare two texts. With these scores a number of statistical techniques can be employed, for example, clustering, the topic of chapter 8.

IR has a number of approaches, and we consider only one: the vector space model. Vector space is a term from linear algebra, but our focus is the specific application of this model to texts, and all the required mathematics is introduced in this chapter. This includes geometric ideas such as angles.

5.2 COUNTING LETTERS AND WORDS

To keep the focus on text, not mathematics, we study the distribution of third-person pro nouns by gender in four Edgar Allan Poe short stories. Section 4.6.1 shows that the length of a text influences the estimates, so these four stories ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118210505Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Text Mining with Perl

by Roger Bilisoly

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.