O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

Book Description

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications brings together all the information, tools and methods a professional will need to efficiently use text mining applications and statistical analysis.

Winner of a 2012 PROSE Award in Computing and Information Sciences from the Association of American Publishers, this book presents a comprehensive how-to reference that shows the user how to conduct text mining and statistically analyze results. In addition to providing an in-depth examination of core text mining and link detection tools, methods and operations, the book examines advanced preprocessing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text mining and link detection using real world example tutorials in such varied fields as corporate, finance, business intelligence, genomics research, and counterterrorism activities.

The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase dramatically.

  • Extensive case studies, most in a tutorial format, allow the reader to 'click through' the example using a software program, thus learning to conduct text mining analyses in the most rapid manner of learning possible
  • Numerous examples, tutorials, power points and datasets available via companion website on Elsevierdirect.com
  • Glossary of text mining terms provided in the appendix

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Endorsements for Practical Text Mining & Statistical Analysis for Non-structured Text Data Applications
  7. Foreword 1
  8. Foreword 2
  9. Foreword 3
  10. Acknowledgments
  11. Preface
  12. About the Authors
  13. Introduction
    1. Building the Workshop Manual
    2. Communication
    3. The Structure of this Book
    4. Part I: Basic Text Mining Principles
    5. Part II: Tutorials
    6. Part III: Advanced Topics
    7. Tutorials
    8. Why Did We Write This Book?
    9. What Are the Benefits of Text Mining?
    10. Blast Off!
    11. References
  14. List of Tutorials by Guest Authors
  15. Part I: Basic Text Mining Principles
    1. Chapter 1. The History of Text Mining
      1. Preamble
      2. The Roots of Text Mining: Information Retrieval, Extraction, and Summarization
      3. Information Extraction and Modern Text Mining
      4. Major Innovations in Text Mining since 2000
      5. The Development of Enabling Technology in Text Mining
      6. Emerging Applications in Text Mining
      7. Sentiment Analysis and Opinion Mining
      8. IBM’s Watson: An “Intelligent” Text Mining Machine?
      9. What’s Next?
      10. Postscript
      11. References
    2. Chapter 2. The Seven Practice Areas of Text Analytics
      1. Preamble
      2. What is Text Mining?
      3. The Seven Practice Areas of Text Analytics
      4. Five Questions for Finding the Right Practice Area
      5. The Seven Practice Areas in Depth
      6. Interactions between the Practice Areas
      7. Scope of This Book
      8. Summary
      9. Postscript
      10. References
    3. Chapter 3. Conceptual Foundations of Text Mining and Preprocessing Steps
      1. Preamble
      2. Introduction
      3. Syntax versus Semantics
      4. The Generalized Vector-Space Model
      5. Preprocessing Text
      6. Creating Vectors from Processed Text
      7. Summary
      8. Postscript
      9. Reference
    4. Chapter 4. Applications and Use Cases for Text Mining
      1. Preamble
      2. Why Is Text Mining Useful?
      3. Extracting “Meaning” from Unstructured Text
      4. Summarizing Text
      5. Common Approaches to Extracting Meaning
      6. Extracting Information through Statistical Natural Language Processing
      7. Statistical Analysis of Dimensions of Meaning
      8. Beyond Statistical Analysis of Word Frequencies: Parsing and Analyzing Syntax
      9. Review
      10. Improving Accuracy in Predictive Modeling
      11. Using Statistical Natural Language Processing to Improve Lift
      12. Using Dictionaries to Improve Prediction
      13. Identifying Similarity and Relevance by Searching
      14. Part of Speech Tagging and Entity Extraction
      15. Summary
      16. Postscript
      17. References
    5. Chapter 5. Text Mining Methodology
      1. Preamble
      2. Text Mining Applications
      3. Cross-Industry Standard Process for Data Mining (CRISP-DM)
      4. Example 1: An Exploratory Literature Survey Using Text Mining
      5. Postscript
      6. References
    6. Chapter 6. Three Common Text Mining Software Tools
      1. Preamble
      2. Introduction
      3. IBM SPSS Modeler Premium
      4. SAS Text Miner
      5. About the Scenarios in This SAS Section
      6. Tips for Text Mining
      7. STATISTICA Text Miner
      8. Summary: STATISTICA Text Miner
      9. Postscript
  16. Part II: Introduction to the Tutorial and Case Study Section of This Book
    1. Introduction
      1. Reference
    2. Tutorial AA. Case Study: Using the Social Share of Voice to Predict Events That Are about to Happen
      1. Analysis
      2. Summary
    3. Tutorial BB. Mining Twitter for Airline Consumer Sentiment
      1. Introduction
      2. What Is R?
      3. Loading Data into R
      4. The twitteR Package
      5. Extracting Text from Tweets
      6. The plyr Package
      7. Estimating Sentiment
      8. Loading the Opinion Lexicon
      9. Implementing Our Sentiment Scoring Algorithm
      10. Algorithm Sanity Check
      11. data.frames Hold Tabular Data
      12. Scoring the Tweets
      13. Repeat for Each Airline
      14. Compare the Score Distributions
      15. Ignore the Middle
      16. Compare with ACSI’s Customer Satisfaction Index
      17. Scrape the ACSI Website
      18. Compare Twitter Results with ACSI Scores
      19. Graph the Results
      20. Notes and Acknowledgments
      21. References
    4. Tutorial A. Using STATISTICA Text Miner to Monitor and Predict Success of Marketing Campaigns Based on Social Media Data
      1. Introduction
      2. The Key Issue
      3. Step 1: Collecting Data
      4. Step 2: Monitoring the Situation
      5. Step 3: Creating Predictive Models
      6. Step 4: Performing a “What-If” Analysis of the Marketing Campaigns
      7. Step 5: Performing Sentiment Analysis
      8. Summary
    5. Tutorial B. Text Mining Improves Model Performance in Predicting Airplane Flight Accident Outcome
      1. Introduction
      2. The Data
      3. Text Mining the Data
      4. Text Mining Results
      5. Data Preparation
      6. Using Text Mining Results to Build Predictive Models
    6. Tutorial C. Insurance Industry: Text Analytics Adds “Lift” to Predictive Models with STATISTICA Text and Data Miner
      1. Introduction
      2. Data Description
      3. Part A: Comparing the Lift of Predictive Models with and without Text Mining
      4. Boosted Trees (without Text Material)
      5. Boosted Trees Adding the Text Mining Variables
      6. How to Merge Graphs
      7. Part B: Enterprise Deployment
      8. Summary
    7. Tutorial D. Analysis of Survey Data for Establishing the “Best Medical Survey Instrument” Using Text Mining
      1. Introduction
      2. The Analysis
      3. Summary
    8. Tutorial E. Analysis of Survey Data for Establishing “Best Medical Survey Instrument” Using Text Mining: Central Asian (Russian Language) Study Tutorial 2: Potential for Constructing Instruments That Have Increased Validity
      1. Introduction
      2. The Analysis
      3. Summary
    9. Tutorial F. Using eBay Text for Predicting ATLAS Instrumental Learning
      1. Introduction
      2. Examining the Data by Types
      3. Summary
      4. Reference
    10. Tutorial G. Text Mining for Patterns in Children’s Sleep Disorders Using STATISTICA Text Miner
      1. Setting Up the Analysis
      2. Reviewing Results
      3. Summary
    11. Tutorial H. Extracting Knowledge from Published Literature Using RapidMiner
      1. Introduction
      2. Motivation
      3. A Brief Introduction to RapidMiner
      4. Text Analytics in RapidMiner
      5. Starting a New Process
      6. Summary
      7. Reference
    12. Tutorial I. Text Mining Speech Samples: Can the Speech of Individuals Diagnosed with Schizophrenia Differentiate Them from Unaffected Controls?
      1. Introduction
      2. Objectives
      3. Case Study: The Steps Used to Prepare the Data
      4. Results and Analysis
      5. Summary
      6. References
    13. Tutorial J. Text Mining Using STM™, CART®, and TreeNet® from Salford Systems: Analysis of 16,000 iPod Auctions on eBay
      1. Installing the Salford Text Miner
      2. Comments on the Challenge
    14. Tutorial K. Predicting Micro Lending Loan Defaults Using SAS® Text Miner
      1. Introduction
      2. About SAS® Text Miner
      3. Project Overview
      4. Preparing the Data and Setting Up the Diagram
      5. Creating a New Project
      6. Registering the Table
      7. Creating a New Diagram
      8. Text Filter Node
      9. Text Topic Node
      10. Creating the Text Mining Flow
      11. Inserting the Data
      12. Understanding Text Parsing
      13. Synonyms and Multiterm Words
      14. Defining Topics
      15. Other Uses of the Interactive Topic Viewer
      16. Making the Predictive Model
      17. Final Results
      18. Viewing the Reports
      19. Text Only Decision Tree
      20. All Variable Text and Relational
      21. Conclusion
    15. Tutorial L. Opera Lyrics: Text Analytics Compared by the Composer and the Century of Composition—Wagner versus Puccini
    16. Tutorial M. Case Study: Sentiment-Based Text Analytics to Better Predict Customer Satisfaction and Net Promoter® Score Using IBM®SPSS® Modeler
      1. Introduction
      2. Business Objectives
      3. Case Study
      4. Creating New Categories and Adding Missing Descriptors
      5. Results and Analysis
      6. Summary
      7. References
    17. Tutorial N. Case Study: Detecting Deception in Text with Freely Available Text and Data Mining Tools
      1. Introduction
      2. General Architecture for Test Engineering
      3. Linguistic Inquiry and Word Count
      4. Working with General Architecture for Test Engineering and Linguistic Inquiry and Word Count Output
      5. Summary
      6. References
    18. Tutorial O. Predicting Box Office Success of Motion Pictures with Text Mining
      1. Introduction
      2. Analysis
      3. Summary
      4. References
    19. Tutorial P. A Hands-On Tutorial of Text Mining in PASW: Clustering and Sentiment Analysis Using Tweets from Twitter
      1. Introduction
      2. Objective
      3. Case Study
      4. Categorization
      5. Cluster Analysis
      6. Analyzing Text Links
      7. Additional Settings
      8. Summary
    20. Tutorial Q. A Hands-On Tutorial on Text Mining in SAS®: Analysis of Customer Comments for Clustering and Predictive Modeling
      1. Introduction
      2. Objective
      3. Case Study
      4. Summary
      5. References
    21. Tutorial R. Scoring Retention and Success of Incoming College Freshmen Using Text Analytics
      1. Introduction
      2. Part I. Predictive Modeling Using Only the Numeric Variables
      3. Part II. Text Mining and Text Variables’ Word Frequencies and Concepts
    22. Tutorial S. Searching for Relationships in Product Recall Data from the Consumer Product Safety Commission with STATISTICA Text Miner
      1. Specifying the Analysis
      2. Reviewing the Results
    23. Tutorial T. Potential Problems That Can Arise in Text Mining: Example Using NALL Aviation Data
      1. Introduction
      2. Spelling Errors
      3. Example: Finding Spelling Errors in Text Miner
      4. Combine Words
      5. Misspellings as Synonyms
      6. Unexpected Terms
      7. Example: Finding Unexpected Terms
      8. Different File Types
      9. Summary
    24. Tutorial U. Exploring the Unabomber Manifesto Using Text Miner
      1. Introduction
      2. Summarizing the Text
      3. Searching for Trends with Pronouns
      4. References
    25. Tutorial V. Text Mining PubMed: Extracting Publications on Genes and Genetic Markers Associated with Migraine Headaches from PubMed Abstracts
    26. Tutorial W. Case Study: The Problem with the Use of Medical Abbreviations by Physicians and Health Care Providers
      1. The Present Problem in the use of Medical Abbreviations by Physicians and Health Care Providers
      2. TJC (JCAHO) “Do Not Use” Abbreviations
      3. Additional Abbreviations, Acronyms, and Symbols
      4. Using the “Text Mining Project” Format of STATISTICA Text Miner
      5. Using TextMiner3.dbs
      6. Conclusion
      7. Intervention Training Needed
      8. References
    27. Tutorial X. Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner
      1. Introduction: Automatic Text Classification
      2. Data File with File References
      3. Specifying the Analysis
      4. Processing the Data Analysis
      5. Saving the Extracted Word Frequencies to the Input File
      6. Initial Feature Selection
      7. General Classification and Regression Trees
      8. K-Nearest Neighbors Modeling
      9. Conclusion
      10. Reference
    28. Tutorial y. Case Study: Predicting Exposure of Social Messages: The Bin Laden Live Tweeter
      1. Introduction
      2. Analysis
      3. Summary
    29. Tutorial Z. The InFLUence Model: Web Crawling, Text Mining, and Predictive Analysis with 2010–2011 Influenza Guidelines—CDC, IDSA, WHO, and FMC
      1. Abstract
      2. Web Crawling and Text Mining of CDC Documents on FLU
      3. Feature Selection
      4. MARSplines Interactive Module Modeling
      5. Boosted Trees
      6. Naïve Bayes Modeling
      7. K-Nearest Neighbors
  17. Part III: Advanced Topics
    1. Chapter 7. Text Classification and Categorization
      1. Preamble
      2. Introduction
      3. Defining a Classification Problem
      4. Feature Creation
      5. Text Classification Algorithms
      6. Combining Evidence
      7. Evaluating Text Classifiers
      8. Hierarchical Text Classification
      9. Text Classification Applications
      10. Summary
      11. Postscript
      12. References
    2. Chapter 8. Prediction in Text Mining: The Data Mining Algorithms of Predictive Analytics
      1. Preamble
      2. Introduction
      3. The Power of Simple Descriptive Statistics, Graphics, and Visual Text Mining
      4. Visual Data Mining
      5. Predictive Modeling (Supervised Learning)
      6. Statistical Models versus General Predictive Modeling
      7. Clustering (Unsupervised Learning)
      8. Singular Value Decomposition, Principal Components Analysis, and Dimension Reduction
      9. Association and Link Analysis
      10. Summary
      11. Postscript
      12. References
    3. Chapter 9. Entity Extraction
      1. Preamble
      2. Introduction
      3. Text Features for Entity Extraction
      4. Strategies for Entity Extraction
      5. Choosing an Entity Extraction Approach
      6. Evaluating Entity Extraction
      7. Summary
      8. Postscript
      9. References
    4. Chapter 10. Feature Selection and Dimensionality Reduction
      1. Preamble
      2. Introduction
      3. Feature Selection
      4. Feature Selection Approaches
      5. Dimensionality Reduction
      6. Linear Dimensionality Reduction Approaches
      7. Postscript
      8. References
    5. Chapter 11. Singular Value Decomposition in Text Mining
      1. Preamble
      2. Introduction
      3. Redundancy in Text
      4. Dimensions of Meaning: Latent Semantic Indexing
      5. The Math of Singular Value Decomposition
      6. Graphical Representations and Simple Examples
      7. Singular Value Decomposition in Equation Form
      8. Singular Value Decomposition and Principal Components Analysis Eigenvalues
      9. Some Practical Considerations
      10. Extracting Dimensions
      11. Subjective Methods: Reviewing Graphs
      12. Analytical Methods: Building Models for Dimensions
      13. Useful Analyses Based on Singular Value Decomposition Scores
      14. Cluster Analysis
      15. Predictive Modeling
      16. When SVD Is Not Useful
      17. Summary
      18. Postscript
      19. References
    6. Chapter 12. Web Analytics and Web Mining
      1. Preamble
      2. Web Analytics
      3. The Value of Web Analytics
      4. The Future of Web Analytics and Web Mining
      5. Postscript
      6. References
    7. Chapter 13. Clustering Words and Documents
      1. Preamble
      2. Introduction
      3. Clustering Algorithms
      4. Clustering Documents
      5. Clustering Words
      6. Cluster Visualization
      7. Summary
      8. Postscript
      9. References
    8. Chapter 14. Leveraging Text Mining in Property and Casualty Insurance
      1. Preamble
      2. Introduction
      3. Property and Casualty Insurance as a Business
      4. Analytics Opportunities in the Insurance Life Cycle
      5. Driving Business Value Using Text Mining
      6. Summary
      7. Postscript
      8. References
    9. Chapter 15. Focused Web Crawling
      1. Preamble
      2. Introduction
      3. The Focused Crawling Process
      4. The Opportunities and Challenges of Mining the Web
      5. Topic Hierarchies for Focused Crawling
      6. Training the Document Classifier
      7. Capturing User Feedback
      8. Summary
      9. Postscript
      10. References
    10. Chapter 16. The Future of Text and Web Analytics
      1. Text Analytics and Text Mining
      2. The Pros and Cons of Commercial Software versus Open Source Software
      3. The Future of Text Mining
      4. The Future of Web Analytics
      5. Multisession Pathing
      6. Integration of Web Analytics with Standard BI Tools
      7. Attribution across Multiple Sessions
      8. The Future: What Does It Hold?
      9. New Areas That May Use Text Analytics in the Future
      10. IBM Watson
      11. Summary
      12. References
      13. IBM-Watson References
    11. Chapter 17. Summary
      1. Why Are You Reading This Chapter?
      2. Our Perspective for Applying Text Mining Technology
      3. Part I: Background and Theory
      4. What Is Text Mining?
      5. What Tools Can I Use?
      6. Part II: The Text Mining Laboratory—28 Tutorials
      7. Part III: Advanced Topics
      8. Outlines of Chapter 7–15
  18. Glossary
  19. Index
  20. How to Use the Data Sets and the Text Mining Software on the DVD or on Links for Practical Text Mining
    1. I Data Sets for the Tutorials in Practical Text Mining
    2. II SAS Text Miner Software
    3. III Salford Systems Software, Including a New Text Miner Module Made for this Book (30-Day Free Trial Available)
    4. IV STATISTICA Text Miner Software (30-day free trial on the DVD that accompanies this book)