19EVALUATING GENERATIVE LARGE LANGUAGE MODELS

Image

What are the standard metrics for evaluating the quality of text generated by large language models, and why are these metrics useful?

Perplexity, BLEU, ROUGE, and BERTScore are some of the most common evaluation metrics used in natural language processing to assess the performance of LLMs across various tasks. Although there is ultimately no way around human quality judgments, human evaluations are tedious, expensive, hard to automate, and subjective. Hence, we develop metrics to provide objective summary scores to measure progress and compare different approaches.

This chapter discusses the difference ...

Get Machine Learning Q and AI now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.