Chapter 10. Evaluating LLM Applications
GitHub Copilot is arguably the first industrial-scale application using LLMs. The curse of going first is that some of the choices you make will seem silly in hindsight, laughably flying in the face of what (by now) everyone knows.
But one of the things we got absolutely right was how we got started. The oldest part of Copilot’s codebase is not the proxy, or the prompts, or the UI, or even the boilerplate setting up the application as an IDE extension. The very first bit of code we wrote was the evaluation, and it’s only thanks to this that we were able to move so fast and so successfully with the rest. That’s because, for every change we made, we could check directly whether it was a step in the right direction, a mistake, or a good attempt that just didn’t have much of an impact. And that’s the main advantage of an evaluation framework for your LLM application: it will guide all future development.
Depending on your application and your project’s position in its lifecycle, different types of evaluation may be available and appropriate. The two big categories here are offline and online evaluation. Offline evaluation is evaluation of example cases that are independent of any live runs of your application. Since it doesn’t require real users or even, in many cases, an end-to-end working app, it will typically be the evaluation you implement first in your project’s lifecycle.
Offline evaluation, however, is somewhat theoretical and possibly ...
Get Prompt Engineering for LLMs now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.