Chapter 1. Introduction
The past two years have seen rapid advances in the development and deployment of large language models (LLMs). Organizations are now embedding LLMs in critical pipelines in applications: customer service, content creation, decision support, and information extraction, among others Bommasani 2021, Zaharia 2024. However, adoption is outpacing our ability to systematically evaluate LLM applications Ward 2024.
Unlike traditional software, LLM applications do not produce deterministic responses, nor do they have formal specifications for the logic they should and should not perform. Their outputs aren’t predictable. Sometimes the outputs are obviously incorrect. Other times, it is hard to describe what is wrong. An output can be factually right but seem inappropriate (the “vibes are off”), or sound completely convincing while being totally incorrect. These ambiguities make evaluation fundamentally different from conventional software testing and even traditional machine learning (ML).1
The core challenge of evaluation is as follows: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access