Chapter 4. Collaborative Evaluation Practices
In the previous chapter (Chapter 3), we walked through error analysis: reading traces, identifying failure modes, and building a taxonomy of how your system goes wrong. That process depends on human judgment at every step. You decide what counts as a failure. You decide how to categorize errors. You decide whether a trace is acceptable or not.
But what happens when your judgment is not enough?
Sometimes the evaluation criteria are inherently subjective. What one person considers a “helpful” response, another might find verbose or off-topic. A tone that feels professional to an engineer might strike a customer service lead as cold. When you are evaluating qualities like clarity, empathy, or appropriateness, there is no objective ground truth to fall back on.
Or, sometimes an expert is required to make quality judgements. In complex domains—legal document review, medical advice, financial analysis—no one person has the expertise to catch every type of error. A software engineer might miss technical inaccuracies ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access