Eval-Driven Development for Reliable Agents
Published by O'Reilly Media, Inc.
Build, test, and refine AI agents using Pydantic AI
What you’ll learn and how you can apply it
- Build type-safe AI agents using Pydantic AI with structured outputs and validation
- Design and implement evaluation frameworks to measure agent performance quantitatively
- Apply iterative improvement cycles using eval results to enhance agent reliability and accuracy
Course description
Building reliable AI agents requires more than just prompt engineering; it demands systematic evaluation and testing. Eval-driven development (EDD) is a methodology that considers AI agent quality a measurable property that can be improved by way of automated evaluations.
With the guidance of AI engineer Ben O’Mahony, you’ll learn how to build, test, and refine AI agents using Pydantic AI. A close examination of two practical examples—a transcription punctuation agent and a data contract generator agent—will help you understand how to define success criteria, create evaluation datasets, and use eval results to systematically improve agent performance. In three hours, you’ll have the skills to confidently deploy AI agents that meet reliability standards and continuously improve over time.
This live event is for you because...
- You’re a software developer who’s building or planning to build LLM-powered applications.
- You work with AI systems and want to move beyond ad hoc testing to systematic evaluation.
- You want to learn modern best practices for building reliable, production-ready AI agents.
Prerequisites
- A computer with uv installed
- An API key from any supported provider to use for the agents
- Intermediate Python experience (type hints, async/await, and modern Python features)
- Basic familiarity with LLMs and API usage (OpenAI, Anthropic, etc.)
- Experience with Git and command-line tools
- An understanding of testing concepts (unit tests, assertions, test-driven development)
Recommended preparation:
- Download the course repository (link to come)
Recommended follow-up:
- Take Building AI Agents with Model Context Protocol (MCP) (live online course with Lucas Soares)
- Take Building Reliable RAG Applications: From PoC to Production (live online course with Sarang Sanjay Kulkarni)
- Read Building Applications with AI Agents (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introduction and foundations (55 minutes)
- Presentation: Why traditional testing fails for AI agents; the EDD philosophy; Pydantic AI and eval-driven development fundamentals (agents, structured outputs, and type safety)
- Demonstration: Building and running a simple agent
- Break
Building evaluated agents: Transcription punctuation agent (65 minutes)
- Presentation: Defining success criteria for text transformation tasks
- Demonstration: Building a transcription punctuation agent with evals and validation
- Hands-on exercise: Run evals locally
- Q&A
- Break
Advanced patterns: Data contract generator (60 minutes)
- Presentation: Complex evaluations for structured outputs
- Demonstration: Building a data contract generator with multi-criteria evals and LLM as a judge
- Group discussion: Applying these patterns to your own use cases
- Q&A
Your Instructor
Ben O'Mahony
Ben O’Mahony is Principal AI Engineer at Thoughtworks. He is a results-driven AI/Engineering leader with a track record of building high-performing teams and shipping business-critical AI, ML and data products and platforms at scale. He has deep expertise across the full Engineering and Data lifecycle from research to production deployment. Ben is adept at defining technical strategy, driving execution and partnering cross-functionally to deliver measurable impact. Recently Ben has been intensely focused on building Generative AI platforms, models and agents.