Eval-Driven Development for Reliable Agents

Intermediate

Build, test, and refine AI agents using Pydantic AI

What you’ll learn and how you can apply it

Build type-safe AI agents using Pydantic AI with structured outputs and validation
Design and implement evaluation frameworks to measure agent performance quantitatively
Apply iterative improvement cycles using eval results to enhance agent reliability and accuracy

Course description

Building reliable AI agents requires more than just prompt engineering; it demands systematic evaluation and testing. Eval-driven development (EDD) is a methodology that considers AI agent quality a measurable property that can be improved by way of automated evaluations.

With the guidance of AI engineer Ben O’Mahony, you’ll learn how to build, test, and refine AI agents using Pydantic AI. A close examination of two practical examples—a transcription punctuation agent and a data contract generator agent—will help you understand how to define success criteria, create evaluation datasets, and use eval results to systematically improve agent performance. In three hours, you’ll have the skills to confidently deploy AI agents that meet reliability standards and continuously improve over time.

This live event is for you because...

You’re a software developer who’s building or planning to build LLM-powered applications.
You work with AI systems and want to move beyond ad hoc testing to systematic evaluation.
You want to learn modern best practices for building reliable, production-ready AI agents.

Prerequisites

A computer with uv installed
An API key from any supported provider to use for the agents
Intermediate Python experience (type hints, async/await, and modern Python features)
Basic familiarity with LLMs and API usage (OpenAI, Anthropic, etc.)
Experience with Git and command-line tools
An understanding of testing concepts (unit tests, assertions, test-driven development)

Recommended preparation:

Download the course repository (link to come)

Recommended follow-up:

Take Building AI Agents with Model Context Protocol (MCP) (live online course with Lucas Soares)
Take Building Reliable RAG Applications: From PoC to Production (live online course with Sarang Sanjay Kulkarni)
Read Building Applications with AI Agents (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction and foundations (55 minutes)

Presentation: Why traditional testing fails for AI agents; the EDD philosophy; Pydantic AI and eval-driven development fundamentals (agents, structured outputs, and type safety)
Demonstration: Building and running a simple agent
Break

Building evaluated agents: Transcription punctuation agent (65 minutes)

Presentation: Defining success criteria for text transformation tasks
Demonstration: Building a transcription punctuation agent with evals and validation
Hands-on exercise: Run evals locally
Q&A
Break

Advanced patterns: Data contract generator (60 minutes)

Presentation: Complex evaluations for structured outputs
Demonstration: Building a data contract generator with multi-criteria evals and LLM as a judge
Group discussion: Applying these patterns to your own use cases
Q&A

Your Instructor

Ben O'Mahony
Ben O’Mahony is Principal AI Engineer at Thoughtworks. He is a results-driven AI/Engineering leader with a track record of building high-performing teams and shipping business-critical AI, ML and data products and platforms at scale. He has deep expertise across the full Engineering and Data lifecycle from research to production deployment. Ben is adept at defining technical strategy, driving execution and partnering cross-functionally to deliver measurable impact. Recently Ben has been intensely focused on building Generative AI platforms, models and agents.
linkedin search

Skill covered

Generative AI

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills