12

Evaluating LLMs

Introduction

Admittedly we’ve spent a vast majority of this book building, thinking about, and iterating our LLM systems, and not as much time establishing rigorous and structured tests against those systems. That being said, we have seen evaluation at play throughout this entire book in bits and pieces. We evaluated our fine-tuned recommendation engine by judging the recommendations it gave out, we tested our classifiers against metrics like accuracy and precision, and we validated our chat-aligned SAWYER and T5 models against our reward mechanisms and even on some benchmarks.

This chapter aggregates all of these evaluation techniques, while adding on to the list. That’s because, at the end of the day, no matter how well ...

Get Quick Start Guide to Large Language Models: Strategies and Best Practices for ChatGPT, Embeddings, Fine-Tuning, and Multimodal AI, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.