book

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

October 2026

Intermediate to advanced

225 pages

2h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
Preface
What This Book CoversWho Should Read This BookWhat This Book Doesn’t CoverHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsA Note on the Pace of Change
1. Introduction
What is Evaluation?Evaluation Throughout the Lifecycle of an LLMPre-trainingPost-trainingApplicationEvaluation in ApplicationsThe Three Gulfs of LLM Application DevelopmentThe Gulf of Comprehension (Between the Developer and Data)The Gulf of Specification (Between the Developer and the LLM Pipeline)The Gulf of Generalization (Between the Data and the LLM Pipeline)Why LLM Pipeline Evaluation is ChallengingThe LLM Evaluation Lifecycle: Bridging the Gulfs with EvaluationAnalyzeMeasureImprovePutting it Together
2. LLMs and Evaluation Basics
Components of LLM ApplicationsSingle-step LLM callsConversationsRetrievalToolsAgentsFundamentals of PromptingPrompts for Single LLM CallsRetrieval FundamentalsTools and AgentsEvaluation SetupsAbsolute EvaluationComparative EvaluationPutting It Together
3. Error Analysis
Establishing TerminologyCreating a Starting Dataset of TracesDefining DimensionsGenerating Tuples and QueriesOpen Coding: Reading and Labeling TracesAxial Coding: Structuring Failure ModesLabeling Traces with Structured Failure ModesIteration and RefinementCommon PitfallsPutting It Together
4. Collaborative Evaluation Practices
When a Single Expert Is EnoughA Collaborative Annotation WorkflowMeasuring Inter-Annotator AgreementPercent Agreement: Simple but MisleadingCohen’s Kappa: Adjusting for ChanceInterpreting Kappa ScoresPython ImplementationWhen to Use Other MetricsFacilitating Alignment SessionsPreparing for the SessionRunning the SessionTechniques for Improving the RubricWhat to AvoidEscalationSave EverythingCommon PitfallsSkipping Independent AnnotationStarting with a Vague RubricUsing Only Percent AgreementFocusing on Past Labels Instead of Future ConsistencyFailing to Close the LoopPutting It Together
5. Implementing Automated Evaluators
Defining the Right Metrics (What to Measure)Implementing Metrics (How to Measure)Writing LLM-as-Judge PromptsData Splits for Designing and Validating LLM-as-JudgeIterative Prompt Refinement for the LLM-as-JudgeWhen to Stop RefiningIf Alignment StallsEstimating True Success Rates with Imperfect JudgesStep 1: Measure Judge AccuracyStep 2: Observe Raw Success RateStep 3: Correct the Observed Success RateStep 4: Quantify Uncertainty with a BootstrapPython Code for Estimating Success RatesOptional: Group-wise Metrics for Evaluating Multiple OutputsCommon PitfallsPutting It Together
About the Authors

Content preview from Evals for AI Engineers

Chapter 1. Introduction

The past two years have seen rapid advances in the development and deployment of large language models (LLMs). Organizations are now embedding LLMs in critical pipelines in applications: customer service, content creation, decision support, and information extraction, among others Bommasani 2021, Zaharia 2024. However, adoption is outpacing our ability to systematically evaluate LLM applications Ward 2024.

Unlike traditional software, LLM applications do not produce deterministic responses, nor do they have formal specifications for the logic they should and should not perform. Their outputs aren’t predictable. Sometimes the outputs are obviously incorrect. Other times, it is hard to describe what is wrong. An output can be factually right but seem inappropriate (the “vibes are off”), or sound completely convincing while being totally incorrect. These ambiguities make evaluation fundamentally different from conventional software testing and even traditional machine learning (ML).¹

The core challenge of evaluation is as follows: ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341660717Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Evals for AI Engineers

by Shreya Shankar, Hamel Husain

Chapter 1. Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.