book

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

October 2026

Intermediate to advanced

225 pages

2h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
Preface
What This Book CoversWho Should Read This BookWhat This Book Doesn’t CoverHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsA Note on the Pace of Change
1. Introduction
What is Evaluation?Evaluation Throughout the Lifecycle of an LLMPre-trainingPost-trainingApplicationEvaluation in ApplicationsThe Three Gulfs of LLM Application DevelopmentThe Gulf of Comprehension (Between the Developer and Data)The Gulf of Specification (Between the Developer and the LLM Pipeline)The Gulf of Generalization (Between the Data and the LLM Pipeline)Why LLM Pipeline Evaluation is ChallengingThe LLM Evaluation Lifecycle: Bridging the Gulfs with EvaluationAnalyzeMeasureImprovePutting it Together
2. LLMs and Evaluation Basics
Components of LLM ApplicationsSingle-step LLM callsConversationsRetrievalToolsAgentsFundamentals of PromptingPrompts for Single LLM CallsRetrieval FundamentalsTools and AgentsEvaluation SetupsAbsolute EvaluationComparative EvaluationPutting It Together
3. Error Analysis
Establishing TerminologyCreating a Starting Dataset of TracesDefining DimensionsGenerating Tuples and QueriesOpen Coding: Reading and Labeling TracesAxial Coding: Structuring Failure ModesLabeling Traces with Structured Failure ModesIteration and RefinementCommon PitfallsPutting It Together
4. Collaborative Evaluation Practices
When a Single Expert Is EnoughA Collaborative Annotation WorkflowMeasuring Inter-Annotator AgreementPercent Agreement: Simple but MisleadingCohen’s Kappa: Adjusting for ChanceInterpreting Kappa ScoresPython ImplementationWhen to Use Other MetricsFacilitating Alignment SessionsPreparing for the SessionRunning the SessionTechniques for Improving the RubricWhat to AvoidEscalationSave EverythingCommon PitfallsSkipping Independent AnnotationStarting with a Vague RubricUsing Only Percent AgreementFocusing on Past Labels Instead of Future ConsistencyFailing to Close the LoopPutting It Together
5. Implementing Automated Evaluators
Defining the Right Metrics (What to Measure)Implementing Metrics (How to Measure)Writing LLM-as-Judge PromptsData Splits for Designing and Validating LLM-as-JudgeIterative Prompt Refinement for the LLM-as-JudgeWhen to Stop RefiningIf Alignment StallsEstimating True Success Rates with Imperfect JudgesStep 1: Measure Judge AccuracyStep 2: Observe Raw Success RateStep 3: Correct the Observed Success RateStep 4: Quantify Uncertainty with a BootstrapPython Code for Estimating Success RatesOptional: Group-wise Metrics for Evaluating Multiple OutputsCommon PitfallsPutting It Together
About the Authors

Content preview from Evals for AI Engineers

Chapter 3. Error Analysis

The process of developing robust evaluations for LLM applications is inherently iterative. It involves creating test cases, assessing performance, and refining the system based on those observations. High-level guides, such as Anthropic’s documentation on creating empirical evaluations for Claude Anthropic 2024, often depict the evaluation process as a cycle of developing test cases, engineering prompts, testing, and refining (Figure 3-1).¹ This section, and indeed our overall “Analyze-Measure-Improve” lifecycle (Figure 1-2), provides a detailed, step-by-step methodology for the Analyze portion of this iterative loop—specifically focusing on ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341660717Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

Chapter 3. Error Analysis

Figure 3-1. Anthropic’s visualization of the iterative process for creating strong empirical evaluations. This cycle emphasizes developing test cases, engineering and refining prompts, and testing against both initial cases and held-out evaluations. Image source: Anthropic Anthropic 2024.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.