book

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

October 2026

Intermediate to advanced

225 pages

4h 39m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
Preface
What This Book CoversWho Should Read This BookWhat This Book Doesn’t CoverHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsA Note on the Pace of Change
1. Introduction
What is Evaluation?Evaluation Throughout the Lifecycle of an LLMPre-trainingPost-trainingApplicationEvaluation in ApplicationsThe Three Gulfs of LLM Application DevelopmentThe Gulf of Comprehension (Between the Developer and Data)The Gulf of Specification (Between the Developer and the LLM Pipeline)The Gulf of Generalization (Between the Data and the LLM Pipeline)Why LLM Pipeline Evaluation is ChallengingThe LLM Evaluation Lifecycle: Bridging the Gulfs with EvaluationAnalyzeMeasureImprovePutting it Together
2. LLMs and Evaluation Basics
Components of LLM ApplicationsSingle-step LLM callsConversationsRetrievalToolsAgentsFundamentals of PromptingPrompts for Single LLM CallsRetrieval FundamentalsTools and AgentsEvaluation SetupsAbsolute EvaluationComparative EvaluationPutting It Together
3. Error Analysis
Establishing TerminologyCreating a Starting Dataset of TracesDefining DimensionsGenerating Tuples and QueriesOpen Coding: Reading and Labeling TracesAxial Coding: Structuring Failure ModesLabeling Traces with Structured Failure ModesIteration and RefinementCommon PitfallsPutting It Together
4. Collaborative Evaluation Practices
When a Single Expert Is EnoughA Collaborative Annotation WorkflowMeasuring Inter-Annotator AgreementPercent Agreement: Simple but MisleadingCohen’s Kappa: Adjusting for ChanceInterpreting Kappa ScoresPython ImplementationWhen to Use Other MetricsFacilitating Alignment SessionsPreparing for the SessionRunning the SessionTechniques for Improving the RubricWhat to AvoidEscalationSave EverythingCommon PitfallsSkipping Independent AnnotationStarting with a Vague RubricUsing Only Percent AgreementFocusing on Past Labels Instead of Future ConsistencyFailing to Close the LoopPutting It Together
5. Implementing Automated Evaluators
Defining the Right Metrics (What to Measure)Implementing Metrics (How to Measure)Writing LLM-as-Judge PromptsData Splits for Designing and Validating LLM-as-JudgeIterative Prompt Refinement for the LLM-as-JudgeWhen to Stop RefiningIf Alignment StallsEstimating True Success Rates with Imperfect JudgesStep 1: Measure Judge AccuracyStep 2: Observe Raw Success RateStep 3: Correct the Observed Success RateStep 4: Quantify Uncertainty with a BootstrapPython Code for Estimating Success RatesOptional: Group-wise Metrics for Evaluating Multiple OutputsCommon PitfallsPutting It Together
6. Evaluating Multi-Turn Conversations
Evaluating at Different LevelsSession LevelTurn LevelConversational Coherence and MemoryPractical Strategies for Multi-Turn EvaluationCollecting TracesIsolating FailuresPerturbation TestingAutomated Evaluation of Multi-Turn TracesAddressing Common PitfallsPutting It Together
7. Evaluating Retrieval-Augmented Generation
OverviewSynthetically Generating Query-Answer PairsCreating Harder Synthetic QueriesFiltering Synthetic QuestionsMetrics for Retrieval QualityPrecision@k and Recall@kMean Reciprocal Rank (MRR)Normalized Discounted Cumulative Gain (NDCG@k)Evaluating and Optimizing Chunking StrategiesTuning Chunk Size and OverlapOther Chunking ApproachesEvaluating Generation QualityCommon PitfallsPutting It Together
8. Evaluating Tool Use and Complex Agents
What Are Tools and Why Do LLMs Use Them?The Spectrum of AgencyEvaluating Tool CallsStage 1: Tool SelectionStage 2: Argument GenerationStage 3: ExecutionStage 4: Output HandlingWhy Evaluate Each Stage SeparatelyDebugging Multi-Step TracesTraceabilityFrom Traces to Systemic PatternsThe Transition Failure HeatmapInvestigating HotspotsEvolving the Heatmap Over TimeHandling Complex Input DataImagesLong DocumentsPDF DocumentsPutting It Together

9. Continuous Integration and Deployment for LLM Agents
CI: Building a Safety Net Against RegressionsCommon Mistakes with CIRevisiting Prompts When Models ImproveCD and Online MonitoringObservabilityRunning Automated Evaluators in ProductionGuardrailsJudge DriftThe Continuous Improvement FlywheelCommon PitfallsPutting It Together
10. Interfaces for Human Review
The Case for Custom Review InterfacesEssential Interface ElementsFeatures That Improve Review Speed and QualityCase Study: EvalGenSelecting Traces for Human ReviewRandom SamplingUncertainty SamplingFailure-Driven SamplingNavigating Trace Groups and Discovering PatternsClusteringSearch and Similarity ToolsIntegrating Human Review into Engineering WorkflowsExample Walkthrough: Reviewing Real Estate Assistant EmailsStarting with a SpreadsheetA Simple Custom UIAn Advanced Review InterfaceCase Study: DocWrangler for Prompt RefinementPutting It Together
11. Data Analysis for Traces
EDA and Clustering to Surface HypothesesClusteringBeyond ClusteringSemantic Search to Find Related TracesUsing AI Coding AssistantsPutting It Together
12. Improving LLM Agents
Accuracy OptimizationLow Effort: Prompt RefinementMedium Effort: Changing How the Agent WorksHigh Effort: Model-Level ChangesCost OptimizationMatch Model Size to Task ComplexityCut Token UsageMake Prompts Cache-FriendlyBatchingDistillationModel CascadesClosing: The Lifecycle ContinuesNext Steps
About the Authors

Content preview from Evals for AI Engineers

Chapter 6. Evaluating Multi-Turn Conversations

Until now, we have focused on evaluating interactions where the user sends one message and the agent sends one response back. We call that a single-turn interaction, meaning one exchange between the user and the agent. Many applications involve multi-turn conversations, where the user and agent go back and forth multiple times. Each exchange (one user message and the agent’s response) is a turn. A full conversation from start to finish is a session.

In multi-turn conversations, the agent has to maintain context across turns, follow instructions over time, and respond coherently as the conversation develops. This creates new evaluation challenges that single-turn methods do not cover.

In this chapter, you will learn:

How to evaluate at the session, turn, and coherence levels
When to isolate failures as single-turn problems versus genuinely multi-turn issues
How to use perturbation testing to probe robustness
How to build session-level evaluators before investing in turn-level analysis

The core evaluation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341660717Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

Chapter 6. Evaluating Multi-Turn Conversations

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.