O'Reilly logo
live online training icon Live Online training

Building Text Analytics Pipelines using NLP

Intermediate Natural Language Processing

Topic: Data
Maryam Jahanshahi

From transaction descriptions to product reviews and social media interactions, there is a treasure trove of business insights hiding in text data. Despite the potential for significant upside, many analysts do not parse natural language data because it lacks structure, requires time-consuming processing steps and a dizzying array of specialized language models and libraries.

In this 3-hour live training, we will cover hands-on approaches for scoping and implementing text analytics projects. Our approach will be to use statistical (rather than machine-learning) approaches that enable robust and reproducible analysis of text data, using the spaCy library in Python. We will cover concepts such as optimizing text processing pipelines, leveraging linguistic features to extract important features, and using clustering algorithms to identify key trends in text data.

The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.

What you'll learn-and how you can apply it

  • Process text data efficiently
  • Identify and extract important features using linguistic structure from text
  • Use the distribution and structure of documents to identify key trends in text data

This training course is for you because...

  • You have an analytic background (i.e. you are a data analyst, BI analyst or business analyst) and need to analyze text data
  • You have a background in Python (i.e. you are a software engineer) and want to learn how to process text data


  • Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks (Jupyter / Google Colab / Kaggle Kernels).

Recommended Preparation

Recommended Follow-up

  • Live Online Training: Leveraging NLP and Word Embeddings in Machine Learning Projects: Intermediate Natural Language Processing by Maryam Jahanshahi on the O'Reilly Learning Platform
  • Live Online Training: Extracting Insights from Text Data using NLP and Word Embeddings: Intermediate Natural Language Processing by Maryam Jahanshahi on the O'Reilly Learning Platform
  • Book: Natural Language Processing with Python and spaCy (Book) by Yuli Vasiliev https://learning.oreilly.com/library/view/natural-language-processing/9781098122652/

About your instructor

  • Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Design considerations for text analysis projects (20 minutes)

  • Unique challenges of text data analysis
  • Considerations in data clean up and preprocessing

Segment 2: Efficiently processing text data for analysis (30 minutes)

  • Tokenization using spaCy
  • Phrase generation using Gensim

Break (10 minutes)

Segment 3: Feature extraction from text data (60 minutes)

  • Word-representation spaces
  • Approaches to normalizing tokens
  • Using dependency parsing to extract entities using spaCy
  • Leveraging part-of-speech tagging to extract noun phrases using spaCy
  • Entity linking using PyTextRank and spaCy

Break (10 minutes)

Segment 4: Exploratory data analysis of documents (30 minutes)

  • Applying clustering algorithms to text data using scikit-Learn
  • Tuning clustering algorithms

Q&A (10 minutes)