Skip to Content
View all events

Eval-Driven Development for Reliable Agents

Published by O'Reilly Media, Inc.

Intermediate content levelIntermediate

Build, test, and refine AI agents using Pydantic AI

What you’ll learn and how you can apply it

  • Build type-safe AI agents using Pydantic AI with structured outputs and validation
  • Design and implement evaluation frameworks to measure agent performance quantitatively
  • Apply iterative improvement cycles using eval results to enhance agent reliability and accuracy

Course description

Building reliable AI agents requires more than just prompt engineering; it demands systematic evaluation and testing. Eval-driven development (EDD) is a methodology that considers AI agent quality a measurable property that can be improved by way of automated evaluations.

With the guidance of AI engineer Ben O’Mahony, you’ll learn how to build, test, and refine AI agents using Pydantic AI. A close examination of two practical examples—a transcription punctuation agent and a data contract generator agent—will help you understand how to define success criteria, create evaluation datasets, and use eval results to systematically improve agent performance. In three hours, you’ll have the skills to confidently deploy AI agents that meet reliability standards and continuously improve over time.

This live event is for you because...

  • You’re a software developer who’s building or planning to build LLM-powered applications.
  • You work with AI systems and want to move beyond ad hoc testing to systematic evaluation.
  • You want to learn modern best practices for building reliable, production-ready AI agents.

Prerequisites

  • A computer with uv installed
  • An API key from any supported provider to use for the agents
  • Intermediate Python experience (type hints, async/await, and modern Python features)
  • Basic familiarity with LLMs and API usage (OpenAI, Anthropic, etc.)
  • Experience with Git and command-line tools
  • An understanding of testing concepts (unit tests, assertions, test-driven development)

Recommended preparation:

  • Download the course repository (link to come)

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction and foundations (55 minutes)

  • Presentation: Why traditional testing fails for AI agents; the EDD philosophy; Pydantic AI and eval-driven development fundamentals (agents, structured outputs, and type safety)
  • Demonstration: Building and running a simple agent
  • Break

Building evaluated agents: Transcription punctuation agent (65 minutes)

  • Presentation: Defining success criteria for text transformation tasks
  • Demonstration: Building a transcription punctuation agent with evals and validation
  • Hands-on exercise: Run evals locally
  • Q&A
  • Break

Advanced patterns: Data contract generator (60 minutes)

  • Presentation: Complex evaluations for structured outputs
  • Demonstration: Building a data contract generator with multi-criteria evals and LLM as a judge
  • Group discussion: Applying these patterns to your own use cases
  • Q&A

Your Instructor

  • Ben O'Mahony

    Ben O’Mahony is Principal AI Engineer at Thoughtworks. He is a results-driven AI/Engineering leader with a track record of building high-performing teams and shipping business-critical AI, ML and data products and platforms at scale. He has deep expertise across the full Engineering and Data lifecycle from research to production deployment. Ben is adept at defining technical strategy, driving execution and partnering cross-functionally to deliver measurable impact. Recently Ben has been intensely focused on building Generative AI platforms, models and agents.

Skill covered

Generative AI