Extracting Insights from Text Data using NLP and Word Embeddings
Intermediate Natural Language Processing
Free-form natural language data comprises the vast majority of data on both our computer systems and the internet at large. Yet, in comparison to structured data, it remains woefully under-analyzed. Over the last few years, significant developments have been made in the field of natural language processing which enable us to use the context of words to create powerful language models.
In this 3-hour live training, we will cover how to generate advanced natural language models. We will cover considerations in the decision of whether to build or “borrow” a language model. Our focus in this class will be on practical implementation of custom word embeddings, using the spaCy library to conduct pre-processing and the genism library to train a model. We will also discuss generating test suites to monitor language models for accuracy as well as bias.
The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.
What you'll learn-and how you can apply it
- Preprocess a text corpus effectively for generating a language model
- Train and tune a custom word embedding model
- Develop test suites to monitor accuracy and bias in a language model
This training course is for you because...
You are a data scientist, software engineer or ML engineer that:
- has a working understanding of the fundamentals of natural language processing (tokenization, part of speech tagging, topic modeling)
- wants to use embeddings to generate trends and insights about natural language data
- wants to be able to transform a corpus of natural language into vector space representations that can be used as inputs for machine learning models
- Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks (Jupyter / Google Colab / Kaggle Kernels).
- Familiarity with the basics of text preprocessing including tokenization, stemming/lemmatization
- Familiarity with basic methods to represent text including one-hot encoding, term frequencies.
- The Course GitHub Repo contains links to :
- A hosted notebook instance, and,
- Instructions on how to set up these environments locally.
- Video: Python Programming by David Beazley https://learning.oreilly.com/videos/python-programming-language/9780134217314
- Video: Modern Python: Big Ideas and Little Code in Python by Raymond Hettinger https://learning.oreilly.com/videos/modern-python-livelessons/9780134743400
- Live Online Training: Building Text Analytics Pipelines using NLP (with Python and spaCy): Intermediate Natural Language Processing by Maryam Jahanshahi on the O'Reilly Learning Platform
- Live Online Training: Leveraging NLP and Word Embeddings in Machine Learning Projects: Intermediate Natural Language Processing by Maryam Jahanshahi on the O'Reilly Learning Platform
- Book: Natural Language Processing with Python and spaCy) by Yuli Vasiliev https://learning.oreilly.com/library/view/natural-language-processing/9781098122652/
- Video: Deep learning for NLP using Python by Tyler Edwards https://learning.oreilly.com/videos/deep-learning-for/9781788621700
- Video: Deep Learning for Natural Language Processing by Jon Krohn https://learning.oreilly.com/videos/deep-learning-for/9780136620013
About your instructor
Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.
The timeframes are only estimates and may vary according to how the class is progressing
Segment 1: Considerations in Language Model Choice (20 minutes)
- Comparing pretrained vs custom embedding models
- Comparisons of different word embedding algorithms (i.e. word2vec, GLoVE PPMI, SVD)
- Biases of embeddings and other language models
Segment 2: Developing Custom Embeddings (30 minutes)
- Implications of preprocessing
- Optimizing for different outputs (semantic relations vs semantic similarity):
Break (10 minutes)
Segment 3: Testing custom embeddings (50 minutes)
- Approaches to test embeddings based on similarity vs relation
- Generating test suites and scoring methods
- Algorithms to visualize embeddings (t-SNE and UMAP)
Break (10 minutes)
Segment 4: Application of custom embeddings (50 minutes)
- Using custom word embeddings in ML algorithms
- Using word embeddings to generate insights from corpora
Q&A (10 minutes)