October 2024
Intermediate to advanced
384 pages
13h 7m
English
Admittedly we’ve spent a vast majority of this book building, thinking about, and iterating our LLM systems, and not as much time establishing rigorous and structured tests against those systems. That being said, we have seen evaluation at play throughout this entire book in bits and pieces. We evaluated our fine-tuned recommendation engine by judging the recommendations it gave out, we tested our classifiers against metrics like accuracy and precision, and we validated our chat-aligned SAWYER and T5 models against our reward mechanisms and even on some benchmarks.
This chapter aggregates all of these evaluation techniques, while adding on to the list. That’s because, at the end of the day, no matter how well ...