Chapter 5. Implementing Automated Evaluators
In Chapter 3, we identified failure modes by reading traces. In Chapter 4, we aligned the team on what counts as a failure. Now we automate that measurement so we can track failure rates at scale without manually reviewing every trace.
In this chapter, you will learn:
-
How to translate failure modes into precise, automatable metrics
-
When to use code-based checks versus LLM-as-Judge evaluators
-
How to build and validate an LLM-as-Judge using training, dev, and test sets
-
How to estimate the true success rate of your agent in production
-
How to quantify uncertainty using bootstrapping and confidence intervals
Automated evaluators can compute various types of metrics.1 Some metrics are reference-free (as defined in Chapter 2), assessing inherent qualities of an output or its adherence to certain rules without needing a “golden” or ground-truth answer. Others are reference-based, comparing the agent’s output to a known correct or ideal response. For many failure modes, we can conceptualize and implement ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access