book

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

October 2026

Intermediate to advanced

225 pages

2h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
Preface
What This Book CoversWho Should Read This BookWhat This Book Doesn’t CoverHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsA Note on the Pace of Change
1. Introduction
What is Evaluation?Evaluation Throughout the Lifecycle of an LLMPre-trainingPost-trainingApplicationEvaluation in ApplicationsThe Three Gulfs of LLM Application DevelopmentThe Gulf of Comprehension (Between the Developer and Data)The Gulf of Specification (Between the Developer and the LLM Pipeline)The Gulf of Generalization (Between the Data and the LLM Pipeline)Why LLM Pipeline Evaluation is ChallengingThe LLM Evaluation Lifecycle: Bridging the Gulfs with EvaluationAnalyzeMeasureImprovePutting it Together
2. LLMs and Evaluation Basics
Components of LLM ApplicationsSingle-step LLM callsConversationsRetrievalToolsAgentsFundamentals of PromptingPrompts for Single LLM CallsRetrieval FundamentalsTools and AgentsEvaluation SetupsAbsolute EvaluationComparative EvaluationPutting It Together
3. Error Analysis
Establishing TerminologyCreating a Starting Dataset of TracesDefining DimensionsGenerating Tuples and QueriesOpen Coding: Reading and Labeling TracesAxial Coding: Structuring Failure ModesLabeling Traces with Structured Failure ModesIteration and RefinementCommon PitfallsPutting It Together
4. Collaborative Evaluation Practices
When a Single Expert Is EnoughA Collaborative Annotation WorkflowMeasuring Inter-Annotator AgreementPercent Agreement: Simple but MisleadingCohen’s Kappa: Adjusting for ChanceInterpreting Kappa ScoresPython ImplementationWhen to Use Other MetricsFacilitating Alignment SessionsPreparing for the SessionRunning the SessionTechniques for Improving the RubricWhat to AvoidEscalationSave EverythingCommon PitfallsSkipping Independent AnnotationStarting with a Vague RubricUsing Only Percent AgreementFocusing on Past Labels Instead of Future ConsistencyFailing to Close the LoopPutting It Together
5. Implementing Automated Evaluators
Defining the Right Metrics (What to Measure)Implementing Metrics (How to Measure)Writing LLM-as-Judge PromptsData Splits for Designing and Validating LLM-as-JudgeIterative Prompt Refinement for the LLM-as-JudgeWhen to Stop RefiningIf Alignment StallsEstimating True Success Rates with Imperfect JudgesStep 1: Measure Judge AccuracyStep 2: Observe Raw Success RateStep 3: Correct the Observed Success RateStep 4: Quantify Uncertainty with a BootstrapPython Code for Estimating Success RatesOptional: Group-wise Metrics for Evaluating Multiple OutputsCommon PitfallsPutting It Together
About the Authors

Content preview from Evals for AI Engineers

Preface

Large language models have moved from research curiosity to production reality with remarkable speed. Organizations now routinely embed LLMs in customer service systems, content generation pipelines, decision support tools, and information extraction workflows. Yet this rapid adoption has outpaced our ability to systematically evaluate whether these systems actually work.

This book addresses that gap.

Unlike traditional software with deterministic outputs, LLM pipelines produce responses that are often subjective, context-dependent, and multifaceted. A response might be factually accurate yet inappropriate for the context. It might sound persuasive while conveying incorrect information. It might also address most but not all parts of the user’s question. These ambiguities make evaluation fundamentally different from conventional software testing or even traditional machine learning validation.

The challenge we tackle is straightforward to state but difficult to solve: How do you assess whether an LLM pipeline is performing adequately? How do you diagnose where it’s failing? And how do you systematically ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341660717Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.