Chapter 5. Formalizing Evaluation Metrics
In chapter 4 you learned strategies for collecting datasets. You saw how error analysis helps you identify common failure patterns in your program, and how to work with domain experts and language models to build a definitive list of typical inputs and expected outputs for your program. It was impressed upon you the importance of collecting useful examples of your task, from which we can learn the ‘rules’ of how your program should work. But a dataset alone isn’t enough, you need a formal way to measure whether your program is getting better or worse. That’s where evaluation metrics come in.
Evaluation is an expert topic, and the primary preoccupation of AI engineers who have a product in production. Ultimately, if you don’t have good eval metrics, you don’t know if your application is failing or succeeding, and can’t anticipate whether a change you want to make will help or harm your users. Context engineering starts and ends with evaluation: it’s how you notice there’s ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access