Model Distillation and Teacher-Student Models

Model distillation has emerged as a key technique to reduce LLM inference cost and latency. While larger models are more capable, they are also more expensive and slower to run. Model distillation is a method to transfer some of the benefits from a larger model to a smaller model that is more practical for scalable use. Google DeepMind has disclosed the use of model distillation in its Gemini and Gemma LLM model series and is likely being used in Anthropic’s and OpenAI’s models.

While currently relatively difficult to implement and less easily applied outside closed AI labs, we expect model distillation to soon join fine-tuning and RAG in the LLM builder’s toolkit. Meta has allowed the use of its ...

Get Building LLMs for Production now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.