O'Reilly logo
live online training icon Live Online training

Leveraging NLP and Word Embeddings in Machine Learning Projects

Intermediate Natural Language Processing

Topic: Data
Maryam Jahanshahi

Manipulating text data is a critical component of any data professional's toolkit. The accessibility of language models has made it much easier to improve the performance of machine learning algorithms based on text data.

In this live training, we will cover how to use word embeddings in supervised machine learning tasks. We will describe key considerations in word representations, with a discussion of different algorithms to generate word vectors. We will also discuss approaches to represent documents as embeddings (including doc2vec vs averaging). Finally, we discuss implementation of embeddings into common machine learning models (as implemented in scikit-learn) in addition to issues with the bias-variance trade-off in embeddings.

The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.

What you'll learn-and how you can apply it

  • Decide which language model to use
  • Represent a document via word embeddings
  • Apply a machine learning algorithm to text data

This training course is for you because...

You are a data analyst, data scientist or software engineer who:

  • Has a working understanding of the fundamentals of natural language processing (tokenization, part of speech tagging, topic modeling)
  • Wants to be able to use word embeddings in machine learning models

Prerequisites

  • Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks (Jupyter / Google Colab / Kaggle Kernels).
  • Familiarity with the basics of text preprocessing including tokenization, stemming/lemmatization
  • Familiarity with basic methods to represent text including one-hot encoding, term frequencies.

Course Set-up:

The Course GitHub Repo contains links to:

  • A hosted notebook instance, and,
  • Instructions on how to set up these environments locally.

Recommended Preparation:

Recommended Follow-up:

About your instructor

  • Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction to Language Models (30 minutes)

  • Intuition behind vector space modeling (vs one hot encoding or TF-IDF)
  • Comparisons of different word embedding algorithms and models (i.e. word2vec, GLoVE PPMI, SVD)

Segment 2: Translating text into numerical representations (40 minutes)

  • Ease of using pretrained embeddings
  • Design considerations in using pretrained models, including noise, sentiments, generalization
  • Break (10 minutes)

Segment 3: Classifying documents using embeddings (70 minutes)

  • Comparing the impact of averaging/summing vs generating document vectors
  • Downstream and performance implications of these decisions
  • Comparing bag of words to embeddings in different ML tasks
  • Break (10 minutes)

Segment 4: Issues in using pretrained embeddings in ML tasks (30 minutes)

  • Dominant word sense and context
  • Measuring and testing bias
  • Implications of poor embedding fit on ML models

Q&A (10 minutes)