Data Engineering Workflows for NLP, LLMs, and Vector Search
Published by O'Reilly Media, Inc.
Implement pipelines while managing performance, cost, and scalability
What you’ll learn and how you can apply it
- Apply classical NLP techniques at scale using Spark NLP for text analytics workloads
- Identify when statistical NLP approaches are preferable to LLM-based solutions
- Understand how NLP and LLM methods can be used in tandem
- Build and deploy a vector data pipeline using LLM embeddings and pgvector in Postgres
- Run containerized data pipelines in cloud environments with controlled costs
- Evaluate architectural trade-offs among NLP pipelines, vector databases, and generative AI components
Course description
Statistical NLP can serve as a fast, cost-efficient alternative or complement to generative AI for many text analytics tasks. Data engineer Matt Housley demonstrates how. You’ll review classical NLP techniques using Spark NLP, then design and build a modern vector data pipeline, generating embeddings with an LLM and loading them into Postgres using pgvector.
The course emphasizes hands-on execution over theory, focusing on realistic engineering workflows and cost-conscious infrastructure usage. Exercises will be run in containerized environments and deployed via O’Reilly’s hosted cloud instances. You’ll gain practical experience implementing pipelines while learning how to evaluate architectural trade-offs across performance, cost, and scalability.
This live event is for you because...
- You’re a data engineer or software engineer who wants to build AI systems.
- You’re an ML engineer who wants hands-on experience with NLP and vector databases to support your training and inference pipelines.
- You’re a technical practitioner interested in embedding-based search and cost-efficient alternatives to pure LLM architectures.
- You’re a student who wants to build AI-oriented career skills.
Prerequisites
- General coding and software engineering experience (The course is built around Python but should be accessible to those with other programming language backgrounds.)
- Basic familiarity with containers (Docker)
- General database experience (e.g., Postgres)
- Working knowledge of cloud concepts (AWS or similar)
- General understanding of data pipelines
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
The cost reality of AI and the return of classical NLP (45 minutes)
- Presentation: Why “LLMs everywhere” breaks at scale; where classical NLP wins
- Group discussion: Where are you already doing text analytics, and what hurts about it?
- Demonstration: Spark NLP pipeline anatomy and how it fits into DE workflows
- Hands-on exercise: Run your first Spark NLP text analytics pipeline (container-based)
- Q&A
Building NLP pipelines that behave in the real world (55 minutes)
- Presentation: Design patterns for NLP in data engineering (batch versus streaming, governance, testing); when to stop tuning and ship—practical evaluation and trade-offs
- Hands-on exercise: Extend the NLP pipeline (add stages, evaluate outputs, handle common edge cases)
- Group discussion: What broke, what surprised you, and what would you change?
- Q&A
- Break
Vector data pipelines with LLM embeddings (60 minutes)
- Presentation: What vectors are for, and what they’re not for; vector database limitations and trade-offs—dimensions, chunk size, approximate nearest neighbor algorithm
- Demonstration: Building the pipeline (vectorizing text, storing in Postgres with pgvector, query basics)
- Hands-on exercise: Run the vector pipeline end-to-end (containers and AWS deployment)
- Q&A
Operational reality and architecture trade-offs (60 minutes)
- Presentation: Classical NLP versus LLM-based processing versus hybrid patterns; practical decision framework
- Hands-on exercise: Cost-control and teardown drill (find resources, estimate burn, tear down safely)
- Group discussion: Decision criteria for 2–3 participant scenarios
Wrap-up and Q&A (20 minutes)
Your Instructor
Matt Housley
Matt Housley, a data engineering consultant and cloud specialist, is cofounder of Ternary Data, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. Matt, Joe Reis, and their guests pontificate on all things data on The Monday Morning Data Chat.
Skills covered
- Data Engineering
- spaCy
- Scikit-learn
- PyTorch