The Natural Language Processing Workshop

Book description

Make NLP easy by building chatbots and models, and executing various NLP tasks to gain data-driven insights from raw text data

Key Features

  • Get familiar with key natural language processing (NLP) concepts and terminology
  • Explore the functionalities and features of popular NLP tools
  • Learn how to use Python programming and third-party libraries to perform NLP tasks

Book Description

Do you want to learn how to communicate with computer systems using Natural Language Processing (NLP) techniques, or make a machine understand human sentiments? Do you want to build applications like Siri, Alexa, or chatbots, even if you've never done it before?

With The Natural Language Processing Workshop, you can expect to make consistent progress as a beginner, and get up to speed in an interactive way, with the help of hands-on activities and fun exercises.

The book starts with an introduction to NLP. You'll study different approaches to NLP tasks, and perform exercises in Python to understand the process of preparing datasets for NLP models. Next, you'll use advanced NLP algorithms and visualization techniques to collect datasets from open websites, and to summarize and generate random text from a document. In the final chapters, you'll use NLP to create a chatbot that detects positive or negative sentiment in text documents such as movie reviews.

By the end of this book, you'll be equipped with the essential NLP tools and techniques you need to solve common business problems that involve processing text.

What you will learn

  • Obtain, verify, clean and transform text data into a correct format for use
  • Use methods such as tokenization and stemming for text extraction
  • Develop a classifier to classify comments in Wikipedia articles
  • Collect data from open websites with the help of web scraping
  • Train a model to detect topics in a set of documents using topic modeling
  • Discover techniques to represent text as word and document vectors

Who this book is for

This book is for beginner to mid-level data scientists, machine learning developers, and NLP enthusiasts. A basic understanding of machine learning and NLP is required to help you grasp the topics in this workshop more quickly.

Table of contents

  1. The Natural Language Processing Workshop
  2. Preface
    1. About the Book
      1. Audience
      2. About the Chapters
      3. Conventions
      4. Code Presentation
      5. Setting up Your Environment
      6. Installation and Setup
      7. Installing the Required Libraries
      8. Installing Libraries
      9. Accessing the Code Files
  3. 1. Introduction to Natural Language Processing
    1. Introduction
    2. History of NLP
    3. Text Analytics and NLP
      1. Exercise 1.01: Basic Text Analytics
    4. Various Steps in NLP
      1. Tokenization
      2. Exercise 1.02: Tokenization of a Simple Sentence
      3. PoS Tagging
      4. Exercise 1.03: PoS Tagging
      5. Stop Word Removal
      6. Exercise 1.04: Stop Word Removal
      7. Text Normalization
      8. Exercise 1.05: Text Normalization
      9. Spelling Correction
      10. Exercise 1.06: Spelling Correction of a Word and a Sentence
      11. Stemming
      12. Exercise 1.07: Using Stemming
      13. Lemmatization
      14. Exercise 1.08: Extracting the Base Word Using Lemmatization
      15. Named Entity Recognition (NER)
      16. Exercise 1.09: Treating Named Entities
    5. Word Sense Disambiguation
      1. Exercise 1.10: Word Sense Disambiguation
    6. Sentence Boundary Detection
      1. Exercise 1.11: Sentence Boundary Detection
      2. Activity 1.01: Preprocessing of Raw Text
    7. Kick Starting an NLP Project
      1. Data Collection
      2. Data Preprocessing
      3. Feature Extraction
      4. Model Development
      5. Model Assessment
      6. Model Deployment
    8. Summary
  4. 2. Feature Extraction Methods
    1. Introduction
    2. Types of Data
      1. Categorizing Data Based on Structure
      2. Categorizing Data Based on Content
    3. Cleaning Text Data
      1. Tokenization
      2. Exercise 2.01: Text Cleaning and Tokenization
      3. Exercise 2.02: Extracting n-grams
      4. Exercise 2.03: Tokenizing Text with Keras and TextBlob
      5. Types of Tokenizers
      6. Exercise 2.04: Tokenizing Text Using Various Tokenizers
      7. Stemming
      8. RegexpStemmer
      9. Exercise 2.05: Converting Words in the Present Continuous Tense into Base Words with RegexpStemmer
      10. The Porter Stemmer
      11. Exercise 2.06: Using the Porter Stemmer
      12. Lemmatization
      13. Exercise 2.07: Performing Lemmatization
      14. Exercise 2.08: Singularizing and Pluralizing Words
      15. Language Translation
      16. Exercise 2.09: Language Translation
      17. Stop-Word Removal
      18. Exercise 2.10: Removing Stop Words from Text
      19. Activity 2.01: Extracting Top Keywords from the News Article
    4. Feature Extraction from Texts
      1. Extracting General Features from Raw Text
      2. Exercise 2.11: Extracting General Features from Raw Text
      3. Exercise 2.12: Extracting General Features from Text
      4. Bag of Words (BoW)
      5. Exercise 2.13: Creating a Bag of Words
      6. Zipf's Law
      7. Exercise 2.14: Zipf's Law
      8. Term Frequency–Inverse Document Frequency (TFIDF)
      9. Exercise 2.15: TFIDF Representation
    5. Finding Text Similarity – Application of Feature Extraction
      1. Exercise 2.16: Calculating Text Similarity Using Jaccard and Cosine Similarity
      2. Word Sense Disambiguation Using the Lesk Algorithm
      3. Exercise 2.17: Implementing the Lesk Algorithm Using String Similarity and Text Vectorization
      4. Word Clouds
      5. Exercise 2.18: Generating Word Clouds
      6. Other Visualizations
      7. Exercise 2.19: Other Visualizations Dependency Parse Trees and Named Entities
      8. Activity 2.02: Text Visualization
    6. Summary
  5. 3. Developing a Text Classifier
    1. Introduction
    2. Machine Learning
      1. Unsupervised Learning
      2. Hierarchical Clustering
      3. Exercise 3.01: Performing Hierarchical Clustering
      4. k-means Clustering
      5. Exercise 3.02: Implementing k-means Clustering
    3. Supervised Learning
      1. Classification
      2. Logistic Regression
      3. Exercise 3.03: Text Classification – Logistic Regression
      4. Naive Bayes Classifiers
      5. Exercise 3.04: Text Classification – Naive Bayes
      6. k-nearest Neighbors
      7. Exercise 3.05: Text Classification Using the k-nearest Neighbors Method
      8. Regression
      9. Linear Regression
      10. Exercise 3.06: Regression Analysis Using Textual Data
      11. Tree Methods
      12. Exercise 3.07: Tree-Based Methods – Decision Tree
      13. Random Forest
      14. Gradient Boosting Machine and Extreme Gradient Boost
      15. Exercise 3.08: Tree-Based Methods – Random Forest
      16. Exercise 3.09: Tree-Based Methods – XGBoost
      17. Sampling
      18. Exercise 3.10: Sampling (Simple Random, Stratified, and Multi-Stage)
    4. Developing a Text Classifier
      1. Feature Extraction
      2. Feature Engineering
      3. Removing Correlated Features
      4. Exercise 3.11: Removing Highly Correlated Features (Tokens)
      5. Dimensionality Reduction
      6. Exercise 3.12: Performing Dimensionality Reduction Using Principal Component Analysis
      7. Deciding on a Model Type
      8. Evaluating the Performance of a Model
      9. Exercise 3.13: Calculating the RMSE and MAPE of a Dataset
      10. Activity 3.01: Developing End-to-End Text Classifiers
    5. Building Pipelines for NLP Projects
      1. Exercise 3.14: Building the Pipeline for an NLP Project
    6. Saving and Loading Models
      1. Exercise 3.15: Saving and Loading Models
    7. Summary
  6. 4. Collecting Text Data with Web Scraping and APIs
    1. Introduction
    2. Collecting Data by Scraping Web Pages
      1. Exercise 4.01: Extraction of Tag-Based Information from HTML Files
      2. Requesting Content from Web Pages
      3. Exercise 4.02: Collecting Online Text Data
      4. Exercise 4.03: Analyzing the Content of Jupyter Notebooks (in HTML Format)
      5. Activity 4.01: Extracting Information from an Online HTML Page
      6. Activity 4.02: Extracting and Analyzing Data Using Regular Expressions
    3. Dealing with Semi-Structured Data
      1. JSON
      2. Exercise 4.04: Working with JSON Files
      3. XML
      4. Exercise 4.05: Working with an XML File
      5. Using APIs to Retrieve Real-Time Data
      6. Exercise 4.06: Collecting Data Using APIs
      7. Extracting data from Twitter Using the OAuth API
      8. Activity 4.03: Extracting Data from Twitter
    4. Summary
  7. 5. Topic Modeling
    1. Introduction
    2. Topic Discovery
      1. Exploratory Data Analysis
      2. Transforming Unstructured Data to Structured Data
      3. Bag of Words
    3. Topic-Modeling Algorithms
      1. Latent Semantic Analysis (LSA)
      2. LSA – How It Works
    4. Key Input Parameters for LSA Topic Modeling
      1. Exercise 5.01: Analyzing Wikipedia World Cup Articles with Latent Semantic Analysis
      2. Dirichlet Process and Dirichlet Distribution
      3. Latent Dirichlet Allocation (LDA)
      4. LDA – How It Works
      5. Measuring the Predictive Power of a Generative Topic Model
      6. Exercise 5.02: Finding Topics in Canadian Open Data Inventory Using the LDA Model
      7. Activity 5.01: Topic-Modeling Jeopardy Questions
    5. Hierarchical Dirichlet Process (HDP)
      1. Exercise 5.03: Topics in Around the World in Eighty Days
      2. Exercise 5.04: Topics in The Life and Adventures of Robinson Crusoe by Daniel Defoe
      3. Practical Challenges
      4. State-of-the-Art Topic Modeling
      5. Activity 5.02: Comparing Different Topic Models
    6. Summary
  8. 6. Vector Representation
    1. Introduction
    2. What Is a Vector?
      1. Frequency-Based Embeddings
      2. Exercise 6.01: Word-Level One-Hot Encoding
      3. Character-Level One-Hot Encoding
      4. Exercise 6.02: Character One-Hot Encoding – Manual
      5. Exercise 6.03: Character-Level One-Hot Encoding with Keras
      6. Learned Word Embeddings
      7. Word2Vec
      8. Exercise 6.04: Training Word Vectors
      9. Using Pre-Trained Word Vectors
      10. Exercise 6.05: Using Pre-Trained Word Vectors
      11. Document Vectors
      12. Uses of Document Vectors
      13. Exercise 6.06: Converting News Headlines to Document Vectors
      14. Activity 6.01: Finding Similar News Article Using Document Vectors
    3. Summary
  9. 7. Text Generation and Summarization
    1. Introduction
    2. Generating Text with Markov Chains
      1. Markov Chains
      2. Exercise 7.01: Text Generation Using a Random Walk over a Markov Chain
    3. Text Summarization
      1. TextRank
    4. Key Input Parameters for TextRank
      1. Exercise 7.02: Performing Summarization Using TextRank
      2. Exercise 7.03: Summarizing a Children's Fairy Tale Using TextRank
      3. Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset
    5. Recent Developments in Text Generation and Summarization
    6. Practical Challenges in Extractive Summarization
    7. Summary
  10. 8. Sentiment Analysis
    1. Introduction
      1. Why Is Sentiment Analysis Required?
      2. The Growth of Sentiment Analysis
      3. The Monetization of Emotion
      4. Types of Sentiments
        1. Emotion
      5. Key Ideas and Terms
      6. Applications of Sentiment Analysis
    2. Tools Used for Sentiment Analysis
      1. NLP Services from Major Cloud Providers
      2. Online Marketplaces
      3. Python NLP Libraries
      4. Deep Learning Frameworks
    3. The textblob library
      1. Exercise 8.01: Basic Sentiment Analysis Using the textblob Library
      2. Activity 8.01: Tweet Sentiment Analysis Using the textblob library
    4. Understanding Data for Sentiment Analysis
      1. Exercise 8.02: Loading Data for Sentiment Analysis
    5. Training Sentiment Models
      1. Activity 8.02: Training a Sentiment Model Using TFIDF and Logistic Regression
    6. Summary
  11. Appendix
    1. 1. Introduction to Natural Language Processing
      1. Activity 1.01: Preprocessing of Raw Text
    2. 2. Feature Extraction Methods
      1. Activity 2.01: Extracting Top Keywords from the News Article
      2. Activity 2.02: Text Visualization
    3. 3. Developing a Text Classifier
      1. Activity 3.01: Developing End-to-End Text Classifiers
    4. 4. Collecting Text Data with Web Scraping and APIs
      1. Activity 4.01: Extracting Information from an Online HTML Page
      2. Activity 4.02: Extracting and Analyzing Data Using Regular Expressions
      3. Activity 4.03: Extracting Data from Twitter
    5. 5. Topic Modeling
      1. Activity 5.01: Topic-Modeling Jeopardy Questions
      2. Activity 5.02: Comparing Different Topic Models
    6. 6. Vector Representation
      1. Activity 6.01: Finding Similar News Article Using Document Vectors
    7. 7. Text Generation and Summarization
      1. Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset
    8. 8. Sentiment Analysis
      1. Activity 8.01: Tweet Sentiment Analysis Using the textblob library
      2. Activity 8.02: Training a Sentiment Model Using TFIDF and Logistic Regression

Product information

  • Title: The Natural Language Processing Workshop
  • Author(s): Rohan Chopra, Aniruddha M. Godbole, Nipun Sadvilkar, Muzaffar Bashir Shah, Sohom Ghosh, Dwight Gunning
  • Release date: August 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781800208421