Data Engineering Workflows for NLP, LLMs, and Vector Search

Published by O'Reilly Media, Inc.

Intermediate

Implement pipelines while managing performance, cost, and scalability

This live event utilizes Jupyter Notebook technology

What you’ll learn and how you can apply it

Apply classical NLP techniques at scale using Spark NLP for text analytics workloads
Identify when statistical NLP approaches are preferable to LLM-based solutions
Understand how NLP and LLM methods can be used in tandem
Build and deploy a vector data pipeline using LLM embeddings and pgvector in Postgres
Run containerized data pipelines in cloud environments with controlled costs
Evaluate architectural trade-offs among NLP pipelines, vector databases, and generative AI components

Course description

Statistical NLP can serve as a fast, cost-efficient alternative or complement to generative AI for many text analytics tasks. Data engineer Matt Housley demonstrates how. You’ll review classical NLP techniques using Spark NLP, then design and build a modern vector data pipeline, generating embeddings with an LLM and loading them into Postgres using pgvector.

The course emphasizes hands-on execution over theory, focusing on realistic engineering workflows and cost-conscious infrastructure usage. Exercises will be run in containerized environments and deployed via O’Reilly’s hosted cloud instances. You’ll gain practical experience implementing pipelines while learning how to evaluate architectural trade-offs across performance, cost, and scalability.

This live event is for you because...

You’re a data engineer or software engineer who wants to build AI systems.
You’re an ML engineer who wants hands-on experience with NLP and vector databases to support your training and inference pipelines.
You’re a technical practitioner interested in embedding-based search and cost-efficient alternatives to pure LLM architectures.
You’re a student who wants to build AI-oriented career skills.

Prerequisites

General coding and software engineering experience (The course is built around Python but should be accessible to those with other programming language backgrounds.)
Basic familiarity with containers (Docker)
General database experience (e.g., Postgres)
Working knowledge of cloud concepts (AWS or similar)
General understanding of data pipelines

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

The cost reality of AI and the return of classical NLP (45 minutes)

Presentation: Why “LLMs everywhere” breaks at scale; where classical NLP wins
Group discussion: Where are you already doing text analytics, and what hurts about it?
Demonstration: Spark NLP pipeline anatomy and how it fits into DE workflows
Hands-on exercise: Run your first Spark NLP text analytics pipeline (container-based)
Q&A

Building NLP pipelines that behave in the real world (55 minutes)

Presentation: Design patterns for NLP in data engineering (batch versus streaming, governance, testing); when to stop tuning and ship—practical evaluation and trade-offs
Hands-on exercise: Extend the NLP pipeline (add stages, evaluate outputs, handle common edge cases)
Group discussion: What broke, what surprised you, and what would you change?
Q&A
Break

Vector data pipelines with LLM embeddings (60 minutes)

Presentation: What vectors are for, and what they’re not for; vector database limitations and trade-offs—dimensions, chunk size, approximate nearest neighbor algorithm
Demonstration: Building the pipeline (vectorizing text, storing in Postgres with pgvector, query basics)
Hands-on exercise: Run the vector pipeline end-to-end (containers and AWS deployment)
Q&A

Operational reality and architecture trade-offs (60 minutes)

Presentation: Classical NLP versus LLM-based processing versus hybrid patterns; practical decision framework
Hands-on exercise: Cost-control and teardown drill (find resources, estimate burn, tear down safely)
Group discussion: Decision criteria for 2–3 participant scenarios

Wrap-up and Q&A (20 minutes)

Your Instructor

Matt Housley
Matt Housley, a data engineering consultant and cloud specialist, is cofounder of Ternary Data, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. Matt, Joe Reis, and their guests pontificate on all things data on The Monday Morning Data Chat.

linkedin link search

Skills covered

Data Engineering

spaCy
Scikit-learn
PyTorch

Cloud Computing