book

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

by Gary D. Miner, John Elder, Andrew Fast, Thomas Hill, Robert Nisbet, Dursun Delen

January 2012

Intermediate to advanced

1000 pages

28h 4m

English

Academic Press

Read now

Unlock full access

Cover image
Title page
Table of Contents
Copyright
Dedication
Endorsements for Practical Text Mining & Statistical Analysis for Non-structured Text Data Applications
Foreword 1
Foreword 2
Foreword 3
Acknowledgments

Preface
About the Authors
Introduction
Building the Workshop ManualCommunicationThe Structure of this BookPart I: Basic Text Mining PrinciplesPart II: TutorialsPart III: Advanced TopicsTutorialsWhy Did We Write This Book?What Are the Benefits of Text Mining?Blast Off!References
List of Tutorials by Guest Authors
Part I: Basic Text Mining Principles
Chapter 1. The History of Text Mining
PreambleThe Roots of Text Mining: Information Retrieval, Extraction, and SummarizationInformation Extraction and Modern Text MiningMajor Innovations in Text Mining since 2000The Development of Enabling Technology in Text MiningEmerging Applications in Text MiningSentiment Analysis and Opinion MiningIBM’s Watson: An “Intelligent” Text Mining Machine?What’s Next?PostscriptReferences
Chapter 2. The Seven Practice Areas of Text Analytics
PreambleWhat is Text Mining?The Seven Practice Areas of Text AnalyticsFive Questions for Finding the Right Practice AreaThe Seven Practice Areas in DepthInteractions between the Practice AreasScope of This BookSummaryPostscriptReferences
Chapter 3. Conceptual Foundations of Text Mining and Preprocessing Steps
PreambleIntroductionSyntax versus SemanticsThe Generalized Vector-Space ModelPreprocessing TextCreating Vectors from Processed TextSummaryPostscriptReference
Chapter 4. Applications and Use Cases for Text Mining
PreambleWhy Is Text Mining Useful?Extracting “Meaning” from Unstructured TextSummarizing TextCommon Approaches to Extracting MeaningExtracting Information through Statistical Natural Language ProcessingStatistical Analysis of Dimensions of MeaningBeyond Statistical Analysis of Word Frequencies: Parsing and Analyzing SyntaxReviewImproving Accuracy in Predictive ModelingUsing Statistical Natural Language Processing to Improve LiftUsing Dictionaries to Improve PredictionIdentifying Similarity and Relevance by SearchingPart of Speech Tagging and Entity ExtractionSummaryPostscriptReferences
Chapter 5. Text Mining Methodology
PreambleText Mining ApplicationsCross-Industry Standard Process for Data Mining (CRISP-DM)Example 1: An Exploratory Literature Survey Using Text MiningPostscriptReferences
Chapter 6. Three Common Text Mining Software Tools
PreambleIntroductionIBM SPSS Modeler PremiumSAS Text MinerAbout the Scenarios in This SAS SectionTips for Text MiningSTATISTICA Text MinerSummary: STATISTICA Text MinerPostscript
Part II: Introduction to the Tutorial and Case Study Section of This Book
Introduction
Reference
Tutorial AA. Case Study: Using the Social Share of Voice to Predict Events That Are about to Happen
AnalysisSummary
Tutorial BB. Mining Twitter for Airline Consumer Sentiment
IntroductionWhat Is R?Loading Data into RThe twitteR PackageExtracting Text from TweetsThe plyr PackageEstimating SentimentLoading the Opinion LexiconImplementing Our Sentiment Scoring AlgorithmAlgorithm Sanity Checkdata.frames Hold Tabular DataScoring the TweetsRepeat for Each AirlineCompare the Score DistributionsIgnore the MiddleCompare with ACSI’s Customer Satisfaction IndexScrape the ACSI WebsiteCompare Twitter Results with ACSI ScoresGraph the ResultsNotes and AcknowledgmentsReferences
Tutorial A. Using STATISTICA Text Miner to Monitor and Predict Success of Marketing Campaigns Based on Social Media Data
IntroductionThe Key IssueStep 1: Collecting DataStep 2: Monitoring the SituationStep 3: Creating Predictive ModelsStep 4: Performing a “What-If” Analysis of the Marketing CampaignsStep 5: Performing Sentiment AnalysisSummary
Tutorial B. Text Mining Improves Model Performance in Predicting Airplane Flight Accident Outcome
IntroductionThe DataText Mining the DataText Mining ResultsData PreparationUsing Text Mining Results to Build Predictive Models
Tutorial C. Insurance Industry: Text Analytics Adds “Lift” to Predictive Models with STATISTICA Text and Data Miner
IntroductionData DescriptionPart A: Comparing the Lift of Predictive Models with and without Text MiningBoosted Trees (without Text Material)Boosted Trees Adding the Text Mining VariablesHow to Merge GraphsPart B: Enterprise DeploymentSummary
Tutorial D. Analysis of Survey Data for Establishing the “Best Medical Survey Instrument” Using Text Mining
IntroductionThe AnalysisSummary
Tutorial E. Analysis of Survey Data for Establishing “Best Medical Survey Instrument” Using Text Mining: Central Asian (Russian Language) Study Tutorial 2: Potential for Constructing Instruments That Have Increased Validity
IntroductionThe AnalysisSummary
Tutorial F. Using eBay Text for Predicting ATLAS Instrumental Learning
IntroductionExamining the Data by TypesSummaryReference
Tutorial G. Text Mining for Patterns in Children’s Sleep Disorders Using STATISTICA Text Miner
Setting Up the AnalysisReviewing ResultsSummary
Tutorial H. Extracting Knowledge from Published Literature Using RapidMiner
IntroductionMotivationA Brief Introduction to RapidMinerText Analytics in RapidMinerStarting a New ProcessSummaryReference
Tutorial I. Text Mining Speech Samples: Can the Speech of Individuals Diagnosed with Schizophrenia Differentiate Them from Unaffected Controls?
IntroductionObjectivesCase Study: The Steps Used to Prepare the DataResults and AnalysisSummaryReferences
Tutorial J. Text Mining Using STM™, CART®, and TreeNet® from Salford Systems: Analysis of 16,000 iPod Auctions on eBay
Installing the Salford Text MinerComments on the Challenge
Tutorial K. Predicting Micro Lending Loan Defaults Using SAS® Text Miner
IntroductionAbout SAS® Text MinerProject OverviewPreparing the Data and Setting Up the DiagramCreating a New ProjectRegistering the TableCreating a New DiagramText Filter NodeText Topic NodeCreating the Text Mining FlowInserting the DataUnderstanding Text ParsingSynonyms and Multiterm WordsDefining TopicsOther Uses of the Interactive Topic ViewerMaking the Predictive ModelFinal ResultsViewing the ReportsText Only Decision TreeAll Variable Text and RelationalConclusion
Tutorial L. Opera Lyrics: Text Analytics Compared by the Composer and the Century of Composition—Wagner versus Puccini
Tutorial M. Case Study: Sentiment-Based Text Analytics to Better Predict Customer Satisfaction and Net Promoter® Score Using IBM®SPSS® Modeler
IntroductionBusiness ObjectivesCase StudyCreating New Categories and Adding Missing DescriptorsResults and AnalysisSummaryReferences
Tutorial N. Case Study: Detecting Deception in Text with Freely Available Text and Data Mining Tools
IntroductionGeneral Architecture for Test EngineeringLinguistic Inquiry and Word CountWorking with General Architecture for Test Engineering and Linguistic Inquiry and Word Count OutputSummaryReferences
Tutorial O. Predicting Box Office Success of Motion Pictures with Text Mining
IntroductionAnalysisSummaryReferences
Tutorial P. A Hands-On Tutorial of Text Mining in PASW: Clustering and Sentiment Analysis Using Tweets from Twitter
IntroductionObjectiveCase StudyCategorizationCluster AnalysisAnalyzing Text LinksAdditional SettingsSummary
Tutorial Q. A Hands-On Tutorial on Text Mining in SAS®: Analysis of Customer Comments for Clustering and Predictive Modeling
IntroductionObjectiveCase StudySummaryReferences
Tutorial R. Scoring Retention and Success of Incoming College Freshmen Using Text Analytics
IntroductionPart I. Predictive Modeling Using Only the Numeric VariablesPart II. Text Mining and Text Variables’ Word Frequencies and Concepts
Tutorial S. Searching for Relationships in Product Recall Data from the Consumer Product Safety Commission with STATISTICA Text Miner
Specifying the AnalysisReviewing the Results
Tutorial T. Potential Problems That Can Arise in Text Mining: Example Using NALL Aviation Data
IntroductionSpelling ErrorsExample: Finding Spelling Errors in Text MinerCombine WordsMisspellings as SynonymsUnexpected TermsExample: Finding Unexpected TermsDifferent File TypesSummary
Tutorial U. Exploring the Unabomber Manifesto Using Text Miner
IntroductionSummarizing the TextSearching for Trends with PronounsReferences
Tutorial V. Text Mining PubMed: Extracting Publications on Genes and Genetic Markers Associated with Migraine Headaches from PubMed Abstracts
Tutorial W. Case Study: The Problem with the Use of Medical Abbreviations by Physicians and Health Care Providers
The Present Problem in the use of Medical Abbreviations by Physicians and Health Care ProvidersTJC (JCAHO) “Do Not Use” AbbreviationsAdditional Abbreviations, Acronyms, and SymbolsUsing the “Text Mining Project” Format of STATISTICA Text MinerUsing TextMiner3.dbsConclusionIntervention Training NeededReferences
Tutorial X. Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner
Introduction: Automatic Text ClassificationData File with File ReferencesSpecifying the AnalysisProcessing the Data AnalysisSaving the Extracted Word Frequencies to the Input FileInitial Feature SelectionGeneral Classification and Regression TreesK-Nearest Neighbors ModelingConclusionReference
Tutorial y. Case Study: Predicting Exposure of Social Messages: The Bin Laden Live Tweeter
IntroductionAnalysisSummary
Tutorial Z. The InFLUence Model: Web Crawling, Text Mining, and Predictive Analysis with 2010–2011 Influenza Guidelines—CDC, IDSA, WHO, and FMC
AbstractWeb Crawling and Text Mining of CDC Documents on FLUFeature SelectionMARSplines Interactive Module ModelingBoosted TreesNaïve Bayes ModelingK-Nearest Neighbors
Part III: Advanced Topics
Chapter 7. Text Classification and Categorization
PreambleIntroductionDefining a Classification ProblemFeature CreationText Classification AlgorithmsCombining EvidenceEvaluating Text ClassifiersHierarchical Text ClassificationText Classification ApplicationsSummaryPostscriptReferences
Chapter 8. Prediction in Text Mining: The Data Mining Algorithms of Predictive Analytics
PreambleIntroductionThe Power of Simple Descriptive Statistics, Graphics, and Visual Text MiningVisual Data MiningPredictive Modeling (Supervised Learning)Statistical Models versus General Predictive ModelingClustering (Unsupervised Learning)Singular Value Decomposition, Principal Components Analysis, and Dimension ReductionAssociation and Link AnalysisSummaryPostscriptReferences
Chapter 9. Entity Extraction
PreambleIntroductionText Features for Entity ExtractionStrategies for Entity ExtractionChoosing an Entity Extraction ApproachEvaluating Entity ExtractionSummaryPostscriptReferences
Chapter 10. Feature Selection and Dimensionality Reduction
PreambleIntroductionFeature SelectionFeature Selection ApproachesDimensionality ReductionLinear Dimensionality Reduction ApproachesPostscriptReferences
Chapter 11. Singular Value Decomposition in Text Mining
PreambleIntroductionRedundancy in TextDimensions of Meaning: Latent Semantic IndexingThe Math of Singular Value DecompositionGraphical Representations and Simple ExamplesSingular Value Decomposition in Equation FormSingular Value Decomposition and Principal Components Analysis EigenvaluesSome Practical ConsiderationsExtracting DimensionsSubjective Methods: Reviewing GraphsAnalytical Methods: Building Models for DimensionsUseful Analyses Based on Singular Value Decomposition ScoresCluster AnalysisPredictive ModelingWhen SVD Is Not UsefulSummaryPostscriptReferences
Chapter 12. Web Analytics and Web Mining
PreambleWeb AnalyticsThe Value of Web AnalyticsThe Future of Web Analytics and Web MiningPostscriptReferences
Chapter 13. Clustering Words and Documents
PreambleIntroductionClustering AlgorithmsClustering DocumentsClustering WordsCluster VisualizationSummaryPostscriptReferences
Chapter 14. Leveraging Text Mining in Property and Casualty Insurance
PreambleIntroductionProperty and Casualty Insurance as a BusinessAnalytics Opportunities in the Insurance Life CycleDriving Business Value Using Text MiningSummaryPostscriptReferences
Chapter 15. Focused Web Crawling
PreambleIntroductionThe Focused Crawling ProcessThe Opportunities and Challenges of Mining the WebTopic Hierarchies for Focused CrawlingTraining the Document ClassifierCapturing User FeedbackSummaryPostscriptReferences
Chapter 16. The Future of Text and Web Analytics
Text Analytics and Text MiningThe Pros and Cons of Commercial Software versus Open Source SoftwareThe Future of Text MiningThe Future of Web AnalyticsMultisession PathingIntegration of Web Analytics with Standard BI ToolsAttribution across Multiple SessionsThe Future: What Does It Hold?New Areas That May Use Text Analytics in the FutureIBM WatsonSummaryReferencesIBM-Watson References
Chapter 17. Summary
Why Are You Reading This Chapter?Our Perspective for Applying Text Mining TechnologyPart I: Background and TheoryWhat Is Text Mining?What Tools Can I Use?Part II: The Text Mining Laboratory—28 TutorialsPart III: Advanced TopicsOutlines of Chapter 7–15
Glossary
Index
How to Use the Data Sets and the Text Mining Software on the DVD or on Links for Practical Text Mining
I Data Sets for the Tutorials in Practical Text MiningII SAS Text Miner SoftwareIII Salford Systems Software, Including a New Text Miner Module Made for this Book (30-Day Free Trial Available)IV STATISTICA Text Miner Software (30-day free trial on the DVD that accompanies this book)

Content preview from Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

Tutorial X

Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner

Contents

Introduction: Automatic Text Classification

Data File with File References

Specifying the Analysis

Processing the Data Analysis

Saving the Extracted Word Frequencies to the Input File

Initial Feature Selection

General Classification and Regression Trees

K-Nearest Neighbors Modeling

Conclusion

Reference

Introduction: Automatic Text Classification

This example is based on the “classic” Reuters collection of documents. Specifically, 5,000 documents were selected from the Reuters-21578 database, which is a collection ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

R: Mining Spatial, Text, Web, and Social Media Data

Publisher Resources

ISBN: 9780123869791

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

by Gary D. Miner, John Elder, Andrew Fast, Thomas Hill, Robert Nisbet, Dursun Delen

Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner

Introduction: Automatic Text Classification

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.