Natural Language Processing Fundamentals

Book description

Use Python and NLTK (Natural Language Toolkit) to build out your own text classifiers and solve common NLP problems.

Key Features

  • Assimilate key NLP concepts and terminologies
  • Explore popular NLP tools and techniques
  • Gain practical experience using NLP in application code

Book Description

If NLP hasn't been your forte, Natural Language Processing Fundamentals will make sure you set off to a steady start. This comprehensive guide will show you how to effectively use Python libraries and NLP concepts to solve various problems.

You'll be introduced to natural language processing and its applications through examples and exercises. This will be followed by an introduction to the initial stages of solving a problem, which includes problem definition, getting text data, and preparing it for modeling. With exposure to concepts like advanced natural language processing algorithms and visualization techniques, you'll learn how to create applications that can extract information from unstructured data and present it as impactful visuals. Although you will continue to learn NLP-based techniques, the focus will gradually shift to developing useful applications. In these sections, you'll understand how to apply NLP techniques to answer questions as can be used in chatbots.

By the end of this book, you'll be able to accomplish a varied range of assignments ranging from identifying the most suitable type of NLP task for solving a problem to using a tool like spacy or gensim for performing sentiment analysis. The book will easily equip you with the knowledge you need to build applications that interpret human language.

What you will learn

  • Obtain, verify, and clean data before transforming it into a correct format for use
  • Perform data analysis and machine learning tasks using Python
  • Understand the basics of computational linguistics
  • Build models for general natural language processing tasks
  • Evaluate the performance of a model with the right metrics
  • Visualize, quantify, and perform exploratory analysis from any text data

Who this book is for

Natural Language Processing Fundamentals is designed for novice and mid-level data scientists and machine learning developers who want to gather and analyze text data to build an NLP-powered product. It'll help you to have prior experience of coding in Python using data types, writing functions, and importing libraries. Some experience with linguistics and probability is useful but not necessary.

Table of contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Learning Objectives
      3. Audience
      4. Approach
      5. Hardware Requirements
      6. Software Requirements
      7. Conventions
      8. Installation and Setup
      9. Working with the Jupyter Notebook
      10. Importing Python Libraries
      11. Installing the Code Bundle
      12. Additional Resources
  2. 1. Introduction to Natural Language Processing
    1. Introduction
    2. History of NLP
    3. Text Analytics and NLP
      1. Exercise 1: Basic Text Analytics
    4. Various Steps in NLP
      1. Tokenization
      2. Exercise 2: Tokenization of a Simple Sentence
      3. PoS Tagging
      4. Exercise 3: PoS Tagging
      5. Stop Word Removal
      6. Exercise 4: Stop Word Removal
      7. Text Normalization
      8. Exercise 5: Text Normalization
      9. Spelling Correction
      10. Exercise 6: Spelling Correction of a Word and a Sentence
      11. Stemming
      12. Exercise 7: Stemming
      13. Lemmatization
      14. Exercise 8: Extracting the base word using Lemmatization
      15. NER
      16. Exercise 9: Treating Named Entities
      17. Word Sense Disambiguation
      18. Exercise 10: Word Sense Disambiguation
      19. Sentence Boundary Detection
      20. Exercise 11: Sentence Boundary Detection
      21. Activity 1: Preprocessing of Raw Text
    5. Kick Starting an NLP Project
      1. Data Collection
      2. Data Preprocessing
      3. Feature Extraction
      4. Model Development
      5. Model Assessment
      6. Model Deployment
    6. Summary
  3. 2. Basic Feature Extraction Methods
    1. Introduction
    2. Types of Data
      1. Categorizing Data Based on Structure
      2. Categorization of Data Based on Content
    3. Cleaning Text Data
      1. Tokenization
      2. Exercise 12: Text Cleaning and Tokenization
      3. Exercise 13: Extracting n-grams
      4. Exercise 14: Tokenizing Texts with Different Packages – Keras and TextBlob
      5. Types of Tokenizers
      6. Exercise 15: Tokenizing Text Using Various Tokenizers
      7. Issues with Tokenization
      8. Stemming
      9. RegexpStemmer
      10. Exercise 16: Converting words in gerund form into base words using RegexpStemmer
      11. The Porter Stemmer
      12. Exercise 17: The Porter Stemmer
      13. Lemmatization
      14. Exercise 18: Lemmatization
      15. Exercise 19: Singularizing and Pluralizing Words
      16. Language Translation
      17. Exercise 20: Language Translation
      18. Stop-Word Removal
      19. Exercise 21: Stop-Word Removal
    4. Feature Extraction from Texts
      1. Extracting General Features from Raw Text
      2. Exercise 22: Extracting General Features from Raw Text
      3. Activity 2: Extracting General Features from Text
      4. Bag of Words
      5. Exercise 23: Creating a BoW
      6. Zipf's Law
      7. Exercise 24: Zipf's Law
      8. TF-IDF
      9. Exercise 25: TF-IDF Representation
      10. Activity 3: Extracting Specific Features from Texts
    5. Feature Engineering
      1. Exercise 26: Feature Engineering (Text Similarity)
      2. Word Clouds
      3. Exercise 27: Word Clouds
      4. Other Visualizations
      5. Exercise 28: Other Visualizations (Dependency Parse Trees and Named Entities)
      6. Activity 4: Text Visualization
    6. Summary
  4. 3. Developing a Text classifier
    1. Introduction
    2. Machine Learning
      1. Unsupervised Learning
      2. Hierarchical Clustering
      3. Exercise 29: Hierarchical Clustering
      4. K-Means Clustering
      5. Exercise 30: K-Means Clustering
      6. Supervised Learning
      7. Classification
      8. Logistic Regression
      9. Naive Bayes Classifiers
      10. K-Nearest Neighbors
      11. Exercise 31: Text Classification (Logistic regression, Naive Bayes, and KNN)
      12. Regression
      13. Linear Regression
      14. Exercise 32: Regression Analysis Using Textual Data
      15. Tree Methods
      16. Random Forest
      17. GBM and XGBoost
      18. Exercise 33: Tree-Based Methods (Decision Tree, Random Forest, GBM, and XGBoost)
      19. Sampling
      20. Exercise 34: Sampling (Simple Random, Stratified, Multi-Stage)
    3. Developing a Text Classifier
      1. Feature Extraction
      2. Feature Engineering
      3. Removing Correlated Features
      4. Exercise 35: Removing Highly Correlated Features (Tokens)
      5. Dimensionality Reduction
      6. Exercise 36: Dimensionality Reduction (PCA)
      7. Deciding on a Model Type
      8. Evaluating the Performance of a Model
      9. Exercise 37: Calculate the RMSE and MAPE
      10. Activity 5: Developing End-to-End Text Classifiers
    4. Building Pipelines for NLP Projects
      1. Exercise 38: Building Pipelines for NLP Projects
    5. Saving and Loading Models
      1. Exercise 39: Saving and Loading Models
    6. Summary
  5. 4. Collecting Text Data from the Web
    1. Introduction
    2. Collecting Data by Scraping Web Pages
      1. Exercise 40: Extraction of Tag-Based Information from HTML Files
    3. Requesting Content from Web Pages
      1. Exercise 41: Collecting Online Text Data
      2. Exercise 42: Analyzing the Content of Jupyter Notebooks (in HTML Format)
      3. Activity 6: Extracting Information from an Online HTML Page
      4. Activity 7: Extracting and Analyzing Data Using Regular Expressions
    4. Dealing with Semi-Structured Data
      1. JSON
      2. Exercise 43: Dealing with JSON Files
      3. Activity 8: Dealing with Online JSON Files
      4. XML
      5. Exercise 44: Dealing with a Local XML File
      6. Using APIs to Retrieve Real-Time Data
      7. Exercise 45: Collecting Data Using APIs
      8. API Creation
      9. Activity 9: Extracting Data from Twitter
      10. Extracting Data from Local Files
      11. Exercise 46: Extracting Data from Local Files
      12. Exercise 47: Performing Various Operations on Local Files
    5. Summary
  6. 5. Topic Modeling
    1. Introduction
    2. Topic Discovery
      1. Discovering Themes
      2. Exploratory Data Analysis
      3. Document Clustering
      4. Dimensionality Reduction
      5. Historical Analysis
      6. Bag of Words
    3. Topic Modeling Algorithms
      1. Latent Semantic Analysis
      2. LSA – How It Works
      3. Exercise 48: Analyzing Reuters News Articles with Latent Semantic Analysis
      4. Latent Dirichlet Allocation
      5. LDA – How It Works
      6. Exercise 49: Topics in Airline Tweets
      7. Topic Fingerprinting
      8. Exercise 50: Visualizing Documents Using Topic Vectors
      9. Activity 10: Topic Modelling Jeopardy Questions
    4. Summary
  7. 6. Text Summarization and Text Generation
    1. Introduction
    2. What is Automated Text Summarization?
      1. Benefits of Automated Text Summarization
    3. High-Level View of Text Summarization
      1. Purpose
      2. Input
      3. Output
      4. Extractive Text Summarization
      5. Abstractive Text Summarization
      6. Sequence to Sequence
      7. Encoder Decoder
    4. TextRank
      1. Exercise 51: TextRank from Scratch
    5. Summarizing Text Using Gensim
      1. Activity 11: Summarizing a Downloaded Page Using the Gensim Text Summarizer
    6. Summarizing Text Using Word Frequency
      1. Exercise 52: Word Frequency Text Summarization
    7. Generating Text with Markov Chains
      1. Markov Chains
      2. Exercise 53: Generating Text Using Markov Chains
    8. Summary
  8. 7. Vector Representation
    1. Introduction
    2. Vector Definition
    3. Why Vector Representations?
      1. Encoding
      2. Character-Level Encoding
      3. Exercise 54: Character Encoding Using ASCII Values
      4. Exercise 55: Character Encoding with the Help of NumPy Arrays
      5. Positional Character-Level Encoding
      6. Exercise 56: Character-Level Encoding Using Positions
      7. One-Hot Encoding
      8. Key Steps in One-Hot Encoding
      9. Exercise 57: Character One-Hot Encoding – Manual
      10. Exercise 58: Character-Level One-Hot Encoding with Keras
      11. Word-Level One-Hot Encoding
      12. Exercise 59: Word-Level One-Hot Encoding
      13. Word Embeddings
      14. Word2Vec
      15. Exercise 60: Training Word Vectors
      16. Using Pre-Trained Word Vectors
      17. Exercise 61: Loading Pre-Trained Word Vectors
      18. Document Vectors
      19. Uses of Document Vectors
      20. Exercise 62: From Movie Dialogue to Document Vectors
      21. Activity 12: Finding Similar Movie Lines Using Document Vectors
    4. Summary
  9. 8. Sentiment Analysis
    1. Introduction
    2. Why is Sentiment Analysis Required?
    3. Growth of Sentiment Analysis
      1. Monetization of Emotion
      2. Types of Sentiments
      3. Key Ideas and Terms
      4. Applications of Sentiment Analysis
    4. Tools Used for Sentiment Analysis
      1. NLP Services from Major Cloud Providers
      2. Online Marketplaces
      3. Python NLP Libraries
      4. Deep Learning Libraries
    5. TextBlob
      1. Exercise 63: Basic Sentiment Analysis Using the TextBlob Library
      2. Activity 13: Tweet Sentiment Analysis Using the TextBlob library
    6. Understanding Data for Sentiment Analysis
      1. Exercise 64: Loading Data for Sentiment Analysis
    7. Training Sentiment Models
      1. Exercise 65: Training a Sentiment Model Using TFIDF and Logistic Regression
    8. Summary
  10. Appendix
    1. 1. Introduction to Natural Language Processing
      1. Activity 1: Preprocessing of Raw Text
    2. 2. Basic Feature Extraction Methods
      1. Activity 2: Extracting General Features from Text
      2. Activity 3: Extracting Specific Features from Texts
      3. Activity 4: Text Visualization
    3. 3. Developing a Text classifier
      1. Activity 5: Developing End-to-End Text Classifiers
    4. 4. Collecting Text Data from the Web
      1. Activity 6: Extracting Information from an Online HTML Page
      2. Activity 7: Extracting and Analyzing Data Using Regular Expressions
      3. Activity 8: Dealing with Online JSON Files
      4. Activity 9: Extracting Data from Twitter
    5. 5. Topic Modeling
      1. Activity 10: Topic Modelling Jeopardy Questions
    6. 6. Text Summarization and Text Generation
      1. Activity 11: Summarizing a Downloaded Page Using the Gensim Text Summarizer
    7. 7. Vector Representation
      1. Activity 12: Finding Similar Movie Lines Using Document Vectors
      2. Solution
    8. 8. Sentiment Analysis
      1. Activity 13: Tweet Sentiment Analysis Using the TextBlob library

Product information

  • Title: Natural Language Processing Fundamentals
  • Author(s): Sohom Ghosh, Dwight Gunning
  • Release date: March 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781789954043