book

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

October 2026

Intermediate to advanced

225 pages

2h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
Preface
What This Book CoversWho Should Read This BookWhat This Book Doesn’t CoverHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsA Note on the Pace of Change
1. Introduction
What is Evaluation?Evaluation Throughout the Lifecycle of an LLMPre-trainingPost-trainingApplicationEvaluation in ApplicationsThe Three Gulfs of LLM Application DevelopmentThe Gulf of Comprehension (Between the Developer and Data)The Gulf of Specification (Between the Developer and the LLM Pipeline)The Gulf of Generalization (Between the Data and the LLM Pipeline)Why LLM Pipeline Evaluation is ChallengingThe LLM Evaluation Lifecycle: Bridging the Gulfs with EvaluationAnalyzeMeasureImprovePutting it Together
2. LLMs and Evaluation Basics
Components of LLM ApplicationsSingle-step LLM callsConversationsRetrievalToolsAgentsFundamentals of PromptingPrompts for Single LLM CallsRetrieval FundamentalsTools and AgentsEvaluation SetupsAbsolute EvaluationComparative EvaluationPutting It Together
3. Error Analysis
Establishing TerminologyCreating a Starting Dataset of TracesDefining DimensionsGenerating Tuples and QueriesOpen Coding: Reading and Labeling TracesAxial Coding: Structuring Failure ModesLabeling Traces with Structured Failure ModesIteration and RefinementCommon PitfallsPutting It Together
4. Collaborative Evaluation Practices
When a Single Expert Is EnoughA Collaborative Annotation WorkflowMeasuring Inter-Annotator AgreementPercent Agreement: Simple but MisleadingCohen’s Kappa: Adjusting for ChanceInterpreting Kappa ScoresPython ImplementationWhen to Use Other MetricsFacilitating Alignment SessionsPreparing for the SessionRunning the SessionTechniques for Improving the RubricWhat to AvoidEscalationSave EverythingCommon PitfallsSkipping Independent AnnotationStarting with a Vague RubricUsing Only Percent AgreementFocusing on Past Labels Instead of Future ConsistencyFailing to Close the LoopPutting It Together
5. Implementing Automated Evaluators
Defining the Right Metrics (What to Measure)Implementing Metrics (How to Measure)Writing LLM-as-Judge PromptsData Splits for Designing and Validating LLM-as-JudgeIterative Prompt Refinement for the LLM-as-JudgeWhen to Stop RefiningIf Alignment StallsEstimating True Success Rates with Imperfect JudgesStep 1: Measure Judge AccuracyStep 2: Observe Raw Success RateStep 3: Correct the Observed Success RateStep 4: Quantify Uncertainty with a BootstrapPython Code for Estimating Success RatesOptional: Group-wise Metrics for Evaluating Multiple OutputsCommon PitfallsPutting It Together
About the Authors

Content preview from Evals for AI Engineers

Chapter 4. Collaborative Evaluation Practices

In the previous chapter (Chapter 3), we walked through error analysis: reading traces, identifying failure modes, and building a taxonomy of how your system goes wrong. That process depends on human judgment at every step. You decide what counts as a failure. You decide how to categorize errors. You decide whether a trace is acceptable or not.

But what happens when your judgment is not enough?

Sometimes the evaluation criteria are inherently subjective. What one person considers a “helpful” response, another might find verbose or off-topic. A tone that feels professional to an engineer might strike a customer service lead as cold. When you are evaluating qualities like clarity, empathy, or appropriateness, there is no objective ground truth to fall back on.

Or, sometimes an expert is required to make quality judgements. In complex domains—legal document review, medical advice, financial analysis—no one person has the expertise to catch every type of error. A software engineer might miss technical inaccuracies ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341660717Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

Chapter 4. Collaborative Evaluation Practices

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.