O'Reilly logo
live online training icon Live Online training

Extracting Insights from Text Data using NLP and Word Embeddings

Intermediate Natural Language Processing

Topic: Data
Maryam Jahanshahi

Free-form natural language data comprises the vast majority of data on both our computer systems and the internet at large. Yet, in comparison to structured data, it remains woefully under-analyzed. Over the last few years, significant developments have been made in the field of natural language processing which enable us to use the context of words to create powerful language models.

In this 3-hour live training, we will cover how to generate advanced natural language models. We will cover considerations in the decision of whether to build or “borrow” a language model. Our focus in this class will be on practical implementation of custom word embeddings, using the spaCy library to conduct pre-processing and the genism library to train a model. We will also discuss generating test suites to monitor language models for accuracy as well as bias.

The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.

What you'll learn-and how you can apply it

  • Preprocess a text corpus effectively for generating a language model
  • Train and tune a custom word embedding model
  • Develop test suites to monitor accuracy and bias in a language model

This training course is for you because...

You are a data scientist, software engineer or ML engineer that:

  • has a working understanding of the fundamentals of natural language processing (tokenization, part of speech tagging, topic modeling)
  • wants to use embeddings to generate trends and insights about natural language data
  • wants to be able to transform a corpus of natural language into vector space representations that can be used as inputs for machine learning models


  • Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks (Jupyter / Google Colab / Kaggle Kernels).
  • Familiarity with the basics of text preprocessing including tokenization, stemming/lemmatization
  • Familiarity with basic methods to represent text including one-hot encoding, term frequencies.

Course Set-up

  • The Course GitHub Repo contains links to :
  • A hosted notebook instance, and,
  • Instructions on how to set up these environments locally.

Recommended Preparation

Recommended Follow-up

About your instructor

  • Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Considerations in Language Model Choice (20 minutes)

  • Comparing pretrained vs custom embedding models
  • Comparisons of different word embedding algorithms (i.e. word2vec, GLoVE PPMI, SVD)
  • Biases of embeddings and other language models

Segment 2: Developing Custom Embeddings (30 minutes)

  • Implications of preprocessing
  • Optimizing for different outputs (semantic relations vs semantic similarity):

Break (10 minutes)

Segment 3: Testing custom embeddings (50 minutes)

  • Approaches to test embeddings based on similarity vs relation
  • Generating test suites and scoring methods
  • Algorithms to visualize embeddings (t-SNE and UMAP)

Break (10 minutes)

Segment 4: Application of custom embeddings (50 minutes)

  • Using custom word embeddings in ML algorithms
  • Using word embeddings to generate insights from corpora

Q&A (10 minutes)