O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Text Mining with R

Book Description

Master text-taming techniques and build effective text-processing applications with R

About This Book

  • Develop all the relevant skills for building text-mining apps with R with this easy-to-follow guide
  • Gain in-depth understanding of the text mining process with lucid implementation in the R language
  • Example-rich guide that lets you gain high-quality information from text data

Who This Book Is For

If you are an R programmer, analyst, or data scientist who wants to gain experience in performing text data mining and analytics with R, then this book is for you. Exposure to working with statistical methods and language processing would be helpful.

What You Will Learn

  • Get acquainted with some of the highly efficient R packages such as OpenNLP and RWeka to perform various steps in the text mining process
  • Access and manipulate data from different sources such as JSON and HTTP
  • Process text using regular expressions
  • Get to know the different approaches of tagging texts, such as POS tagging, to get started with text analysis
  • Explore different dimensionality reduction techniques, such as Principal Component Analysis (PCA), and understand its implementation in R
  • Discover the underlying themes or topics that are present in an unstructured collection of documents, using common topic models such as Latent Dirichlet Allocation (LDA)
  • Build a baseline sentence completing application
  • Perform entity extraction and named entity recognition using R

In Detail

Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages.

Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework.

By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media.

Style and approach

This book takes a hands-on, example-driven approach to the text mining process with lucid implementation in R.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Mastering Text Mining with R
    1. Table of Contents
    2. Mastering Text Mining with R
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Statistical Linguistics with R
      1. Probability theory and basic statistics
        1. Probability space and event
        2. Theorem of compound probabilities
        3. Conditional probability
        4. Bayes' formula for conditional probability
        5. Independent events
        6. Random variables
        7. Discrete random variables
          1. Continuous random variables
        8. Probability frequency function
        9. Probability distributions using R
        10. Cumulative distribution function
        11. Joint distribution
        12. Binomial distribution
        13. Poisson distribution
        14. Counting occurrences
        15. Zipf's law
        16. Heaps' law
        17. Lexical richness
          1. Lexical variation
          2. Lexical density
          3. Lexical originality
          4. Lexical sophistication
      2. Language models
        1. N-gram models
        2. Markov assumption
        3. Hidden Markov models
      3. Quantitative methods in linguistics
        1. Document term matrix
          1. Inverse document frequency
          2. Words similarity and edit-distance functions
          3. Euclidean distance
          4. Cosine similarity
          5. Levenshtein distance
          6. Damerau-Levenshtein distance
          7. Hamming distance
            1. Jaro-Winkler distance
            2. Measuring readability of a text
          8. Gunning frog index
      4. R packages for text mining
        1. OpenNLP
        2. Rweka
        3. RcmdrPlugin.temis
        4. tm
        5. languageR
        6. koRpus
        7. RKEA
        8. maxent
        9. lsa
      5. Summary
    10. 2. Processing Text
      1. Accessing text from diverse sources
        1. File system
          1. PDF documents
          2. Microsoft Word documents
          3. HTML
          4. XML
          5. JSON
          6. HTTP
        2. Databases
      2. Processing text using regular expressions
        1. Tokenization and segmentation
          1. Word tokenization
          2. Operations on a document-term matrix
          3. Sentence segmentation
      3. Normalizing texts
        1. Lemmatization and stemming
          1. Stemming
          2. Lemmatization
          3. Synonyms
      4. Lexical diversity
        1. Analyse lexical diversity
        2. Calculate lexical diversity
        3. Readability
          1. Automated readability index
      5. Language detection
      6. Summary
    11. 3. Categorizing and Tagging Text
      1. Parts of speech tagging
        1. POS tagging with R packages
      2. Hidden Markov Models for POS tagging
        1. Basic definitions and notations
        2. Implementing HMMs
            1. Viterbi underflow
            2. Forward algorithm underflow
        3. OpenNLP chunking
        4. Chunk tags
      3. Collocation and contingency tables
        1. Extracting co-occurrences
          1. Surface Co-occurrence
          2. Textual co-occurrence
          3. Syntactic co-occurrence
        2. Co-occurrence in a document
        3. Quantifying the relation between words
          1. Contingency tables
          2. Detailed analysis on textual collocations
      4. Feature extraction
        1. Synonymy and similarity
        2. Multiwords, negation, and antonymy
        3. Concept similarity
          1. Path length
          2. Resnik similarity
          3. Lin similarity
          4. Jiang – Conrath distance
      5. Summary
    12. 4. Dimensionality Reduction
      1. The curse of dimensionality
        1. Distance concentration and computational infeasibility
      2. Dimensionality reduction
        1. Principal component analysis
        2. Using R for PCA
          1. Understanding the FactoMineR package
          2. Amap package
          3. Proportion of variance
          4. Scree plot
        3. Reconstruction error
      3. Correspondence analysis
          1. Canonical correspondence analysis
            1. Pearson's Chi-squared test
          2. Multiple correspondence analysis
        1. Implementation of SVD using R
      4. Summary
    13. 5. Text Summarization and Clustering
      1. Topic modeling
        1. Latent Dirichlet Allocation
        2. Correlated topic model
          1. Model selection
          2. R Package for topic modeling
            1. Fitting the LDA model with the VEM algorithm
      2. Latent semantic analysis
        1. R Package for latent semantic analysis
          1. Illustrative example of LSA
      3. Text clustering
      4. Document clustering
        1. Feature selection for text clustering
          1. Mutual information
          2. Statistic Chi Square feature selection
          3. Frequency-based feature selection
      5. Sentence completion
      6. Summary
    14. 6. Text Classification
      1. Text classification
      2. Document representation
        1. Feature hashing
        2. Classifiers – inductive learning
        3. Tree-based learning
        4. Bayesian classifiers: Naive Bayes classification
          1. K-Nearest neighbors
      3. Kernel methods
        1. Support vector machines
          1. Kernel Trick
        2. How to apply SVM on a real world example?
        3. Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier
          1. Maxent implemenation in R
        4. RTextTools: a text classification framework
        5. Model evaluation
          1. Confusion matrix
          2. ROC curve
          3. Precision-recall
      4. Bias–variance trade-off and learning curve
        1. Bias-variance decomposition
      5. Learning curve
      6. Dealing with reducible error components
        1. Cross validation
          1. Leave-one-out
          2. k-Fold
          3. Bootstrap
          4. Stratified
      7. Summary
    15. 7. Entity Recognition
      1. Entity extraction
        1. The rule-based approach
        2. Machine learning
      2. Sentence boundary detection
        1. Word token annotator
      3. Named entity recognition
          1. Training a model with new features
      4. Summary
    16. Index