book

Practical Text Mining with Perl

Name: Practical Text Mining with Perl
Author: Roger Bilisoly
ISBN: 9780470176436

by Roger Bilisoly

August 2008

Intermediate to advanced

320 pages

9h 14m

English

Wiley

Read now

Unlock full access

COVER
SERIES TITLE
TITLE
COPYRIGHT PAGE
DEDICATION
LIST OF FIGURES
LIST OF TABLES
PREFACE
ACKNOWLEDGMENTS
CHAPTER 1: INTRODUCTION
1.1 OVERVIEW OF THIS BOOK1.2 TEXT MINING AND RELATED FIELDS1.3 ADVICE FOR READING THIS BOOK

CHAPTER 2: TEXT PATTERNS
2.1 INTRODUCTION2.2 REGULAR EXPRESSIONS2.3 FINDING WORDS IN A TEXT2.4 DECOMPOSING POE’S “THE TELL-TALE HEART” INTO WORDS2.5 A SIMPLE CONCORDANCE2.6 FIRST ATTEMPT AT EXTRACTING SENTENCES2.7 REGEX ODDS AND ENDS2.8 REFERENCESPROBLEMS
CHAPTER 3: QUANTITATIVE TEXT SUMMARIES
3.1 INTRODUCTION3.2 SCALARS, INTERPOLATION, AND CONTEXT IN PERL3.3 ARRAYS AND CONTEXT IN PERL3.4 WORD LENGTHS IN POE’S “THE TELL-TALE HEART”3.5 ARRAYS AND FUNCTIONS3.6 HASHES3.7 TWO TEXT APPLICATIONS3.8 COMPLEX DATA STRUCTURES3.9 REFERENCES3.10 FIRST TRANSITIONPROBLEMS
CHAPTER 4: PROBABILITY AND TEXT SAMPLING
4.1 INTRODUCTION4.2 PROBABILITY4.3 CONDITIONAL PROBABILITY4.4 MEAN AND VARIANCE OF RANDOM VARIABLES4.5 THE BAG-OF-WORDS MODEL FOR POE’S “THE BLACK CAT”4.6 THE EFFECT OF SAMPLE SIZE4.7 REFERENCESPROBLEMS
CHAPTER 5: APPLYING INFORMATION RETRIEVAL TO TEXT MINING
5.1 INTRODUCTION5.2 COUNTING LETTERS AND WORDS5.3 TEXT COUNTS AND VECTORS5.4 THE TERM-DOCUMENT MATRIX APPLIED TO POE5.5 MATRIX MULTIPLICATION5.6 FUNCTIONS OF COUNTS5.7 DOCUMENT SIMILARITY5.8 REFERENCESPROBLEMS
CHAPTER 6: CONCORDANCE LINES AND CORPUS LINGUISTICS
6.1 INTRODUCTION6.2 SAMPLING6.3 CORPUS AS BASELINE6.4 CONCORDANCING6.5 COLLOCATIONS AND CONCORDANCE LINES6.6 APPLICATIONS WITH REFERENCES6.7 SECOND TRANSITIONPROBLEMS
CHAPTER 7: MULTI VARIATE TECHNIQUES WITH TEXT
7.1 INTRODUCTION7.2 BASIC STATISTICS7.3 BASIC LINEAR ALGEBRA7.4 PRINCIPAL COMPONENTS ANALYSIS7.5 TEXT APPLICATIONS7.6 APPLICATIONS AND REFERENCESPROBLEMS
CHAPTER 8: TEXT CLUSTERING
8.1 INTRODUCTION8.2 CLUSTERING8.3 A NOTE ON CLASSIFICATION8.4 REFERENCES8.5 LAST TRANSITIONPROBLEMS
CHAPTER 9: A SAMPLE OF ADDITIONAL TOPICS
9.1 INTRODUCTION9.2 PERL MODULES9.3 OTHER LANGUAGES: ANALYZING GOETHE IN GERMAN9.4 PERMUTATION TESTS9.5 REFERENCES
APPENDIX A: OVERVIEW OF PERL FOR TEXT MINING
A.1 BASIC DATA STRUCTURESA.2 OPERATORSA.3 BRANCHING AND LOOPINGA.4 A FEW PERL FUNCTIONSA.5 INTRODUCTION TO REGULAR EXPRESSIONS
APPENDIX B: SUMMARY OF R USED IN THIS BOOK
B.1 BASICS OF RB.2 THIS BOOK’S R CODE
REFERENCES
INDEX

Content preview from Practical Text Mining with Perl

CHAPTER 6

CONCORDANCE LINES AND CORPUS LINGUISTICS

6.1 INTRODUCTION

A corpus (plural corpora) is a collection of texts that have been put together to research one or more aspects of language. This term is from the Latin and means body. Not surprisingly, corpus linguistics is the study of language using a corpus.

The idea of collecting language samples is old. For example, Samuel Johnson’s dictionary was the first in English to emphasize how words are used by supplying over 100,000 quotations (see the introduction of the abridged version edited by Lynch [61] for more details). Note that his dictionary is still in print. In fact, a complete digital facsimile of the first edition is available [62].

In the spirit of Samuel Johnson, a number of large corpora have been developed to support language references, for example, the Longman Dictionary of American English [74] or the Cambridge Grammar of English [26]. To analyze such corpora, this chapter creates concordances.

The next section introduces a few ideas of statistical sampling, and then considers how to apply these to text sampling. The rest of this chapter discusses examples of concordancing, which provide ample opportunity to apply the Perl programming techniques covered in the earlier chapters.

6.2 SAMPLING

Sampling replaces measuring all of the objects in a population with those from a subset. Assuming that the sample is representative of the population, then estimates are computable along with their accuracy. Although taking ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118210505Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Text Mining with Perl

by Roger Bilisoly

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.