Machine Learning Techniques for Text

Book description

Take your Python text processing skills to another level by learning about the latest natural language processing and machine learning techniques with this full color guide

Key Features

  • Learn how to acquire and process textual data and visualize the key findings
  • Obtain deeper insight into the most commonly used algorithms and techniques and understand their tradeoffs
  • Implement models for solving real-world problems and evaluate their performance

Book Description

With the ever-increasing demand for machine learning and programming professionals, it's prime time to invest in the field. This book will help you in this endeavor, focusing specifically on text data and human language by steering a middle path among the various textbooks that present complicated theoretical concepts or focus disproportionately on Python code.

A good metaphor this work builds upon is the relationship between an experienced craftsperson and their trainee. Based on the current problem, the former picks a tool from the toolbox, explains its utility, and puts it into action. This approach will help you to identify at least one practical use for each method or technique presented. The content unfolds in ten chapters, each discussing one specific case study. For this reason, the book is solution-oriented. It's accompanied by Python code in the form of Jupyter notebooks to help you obtain hands-on experience. A recurring pattern in the chapters of this book is helping you get some intuition on the data and then implement and contrast various solutions.

By the end of this book, you'll be able to understand and apply various techniques with Python for text preprocessing, text representation, dimensionality reduction, machine learning, language modeling, visualization, and evaluation.

What you will learn

  • Understand fundamental concepts of machine learning for text
  • Discover how text data can be represented and build language models
  • Perform exploratory data analysis on text corpora
  • Use text preprocessing techniques and understand their trade-offs
  • Apply dimensionality reduction for visualization and classification
  • Incorporate and fine-tune algorithms and models for machine learning
  • Evaluate the performance of the implemented systems
  • Know the tools for retrieving text data and visualizing the machine learning workflow

Who this book is for

This book is for professionals in the area of computer science, programming, data science, informatics, business analytics, statistics, language technology, and more who aim for a gentle career shift in machine learning for text. Students in relevant disciplines that seek a textbook in the field will benefit from the practical aspects of the content and how the theory is presented. Finally, professors teaching a similar course will be able to pick pertinent topics in terms of content and difficulty. Beginner-level knowledge of Python programming is needed to get started with this book.

Table of contents

  1. Machine Learning Techniques for Text
  2. Acknowledgments
  3. Contributors
  4. About the author
  5. About the reviewers
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  7. Chapter 1: Introducing Machine Learning for Text
    1. The language phenomenon
    2. The data explosion
    3. The era of AI
    4. Relevant research fields
    5. The machine learning paradigm
    6. Taxonomy of machine learning techniques
      1. Supervised learning
      2. Unsupervised learning
      3. Semi-supervised learning
      4. Reinforcement learning
    7. Visualization of the data
    8. Evaluation of the results
    9. Summary
  8. Chapter 2: Detecting Spam Emails
    1. Technical requirements
    2. Understanding spam detection
      1. Explaining feature engineering
    3. Extracting word representations
      1. Using label encoding
      2. Using one-hot encoding
      3. Using token count encoding
      4. Using tf-idf encoding
    4. Executing data preprocessing
      1. Tokenizing the input
      2. Removing stop words
      3. Stemming the words
      4. Lemmatizing the words
    5. Performing classification
      1. Getting the data
      2. Creating the train and test sets
      3. Preprocessing the data
      4. Extracting the features
      5. Introducing the Support Vector Machines algorithm
      6. Understanding Bayes’ theorem
    6. Measuring classification performance
      1. Calculating accuracy
      2. Calculating precision and recall
      3. Calculating the F-score
      4. Creating ROC and AUC
      5. Creating precision-recall curves
    7. Summary
  9. Chapter 3: Classifying Topics of Newsgroup Posts
    1. Technical requirements
    2. Understanding topic classification
    3. Performing exploratory data analysis
    4. Executing dimensionality reduction
      1. Understanding principal component analysis
      2. Understanding linear discriminant analysis
      3. Putting PCA and LDA into action
    5. Introducing the k-nearest neighbors algorithm
      1. Performing feature extraction
      2. Performing cross-validation
      3. Performing classification
      4. Comparison to the baseline model
    6. Introducing the random forest algorithm
      1. Contracting a decision tree
      2. Performing classification
    7. Extracting word embedding representation
      1. Understanding word embedding
      2. Performing vector arithmetic
      3. Performing classification
      4. Using the fastText tool
    8. Summary
  10. Chapter 4: Extracting Sentiments from Product Reviews
    1. Technical requirements
    2. Understanding sentiment analysis
    3. Performing exploratory data analysis
      1. Using the Software dataset
      2. Exploiting the ratings of products
      3. Extracting the word count of reviews
      4. Exploiting the helpfulness score
    4. Introducing linear regression
      1. Putting linear regression into action
    5. Introducing logistic regression
      1. Understanding gradient descent
      2. Using logistic regression
      3. Creating training and test sets
      4. Performing classification
      5. Applying regularization
    6. Introducing deep neural networks
      1. Understanding logic gates
      2. Understanding perceptrons
      3. Understanding artificial neurons
      4. Creating artificial neural networks
      5. Training artificial neural networks
      6. Performing classification
    7. Summary
  11. Chapter 5: Recommending Music Titles
    1. Technical requirements
    2. Understanding recommender systems
    3. Performing exploratory data analysis
      1. Cleaning the data
      2. Extracting information from the data
      3. Understanding the Pearson correlation
    4. Introducing content-based filtering
      1. Extracting music recommendations
    5. Introducing collaborative filtering
      1. Using memory-based collaborative recommenders
      2. Applying SVD
      3. Clustering handwritten text
      4. Applying t-SNE
      5. Using model-based collaborative systems
      6. Introducing autoencoders
    6. Summary
  12. Chapter 6: Teaching Machines to Translate
    1. Technical requirements
    2. Understanding machine translation
    3. Introducing rule-based machine translation
      1. Using direct machine translation
      2. Using transfer-based machine translation
      3. Using interlingual machine translation
    4. Introducing example-based machine translation
    5. Introducing statistical machine translation
      1. Modeling the translation problem
      2. Creating the models
    6. Introducing sequence-to-sequence learning
      1. Deciphering the encoder/decoder architecture
      2. Understanding long short-term memory units
      3. Putting seq2seq in action
    7. Measuring translation performance
    8. Summary
  13. Chapter 7: Summarizing Wikipedia Articles
    1. Technical requirements
    2. Understanding text summarization
    3. Introducing web scraping
      1. Scraping popular quotes
      2. Scraping book reviews
      3. Scraping Wikipedia articles
    4. Performing extractive summarization
    5. Performing abstractive summarization
      1. Introducing the attention mechanism
      2. Introducing transformers
      3. Putting the transformer into action
    6. Measuring summarization performance
    7. Summary
  14. Chapter 8: Detecting Hateful and Offensive Language
    1. Technical requirements
    2. Introducing social networks
    3. Understanding BERT
      1. Pre-training phase
      2. Fine-tuning phase
      3. Putting BERT into action
    4. Introducing boosting algorithms
      1. Understanding AdaBoost
      2. Understanding gradient boosting
      3. Understanding XGBoost
    5. Creating validation sets
      1. Learning the myth of Icarus
      2. Extracting the datasets
    6. Treating imbalanced datasets
    7. Classifying with BERT
      1. Training the classifier
      2. Applying early stopping
    8. Understanding CNN
      1. Adding pooling layers
      2. Including CNN layers
    9. Summary
  15. Chapter 9: Generating Text in Chatbots
    1. Technical requirements
    2. Understanding text generation
    3. Creating a retrieval-based chatbot
    4. Understanding language modeling
      1. Understanding perplexity
      2. Building a language model
    5. Creating a generative chatbot
      1. Using a pre-trained model
      2. Creating the GUI
      3. Creating the web chatbot
      4. Fine-tuning a pre-trained model
    6. Summary
  16. Chapter 10: Clustering Speech-to-Text Transcriptions
    1. Technical requirements
    2. Understanding text clustering
    3. Preprocessing the data
    4. Using speech-to-text
    5. Introducing the K-means algorithm
      1. Putting K-means into action
    6. Introducing DBSCAN
      1. Putting DBSCAN into action
      2. Assessing DBSCAN
    7. Introducing the hierarchical clustering algorithm
      1. Putting hierarchical clustering into action
    8. Introducing the LDA algorithm
      1. Putting LDA into action
    9. Summary
  17. Index
    1. Why subscribe?
  18. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Machine Learning Techniques for Text
  • Author(s): Nikos Tsourakis
  • Release date: October 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781803242385