Chapter 10. Advancements in LLM Serving
If you’ve made it this far, congratulations on going through the journey from understanding model serving paradigms all the way to how to serve LLMs efficiently for different use cases.
The final chapter highlights a few emerging advancements in LLM serving and serves as a guide for your continued learning as new ideas and techniques rapidly evolve. Some of these ideas could fill entire books, and progress in the field is moving quickly. It is an exciting time to witness and contribute to the evolution of intelligent inference systems. Our goal here is to introduce the main ideas and frameworks so that you leave this book feeling equipped to connect the core foundations we have covered with the new ideas shaping the next generation of LLM serving systems.
In this chapter, we will explore:
Semantic caching and routing as high-level mechanisms for smarter semantic-aware request distribution
Performance profiling for fine-grained performance tuning
Multimodal serving, as text-based LLMs expand into vision language models (VLMs) and other modalities
Edge serving, bringing low-latency, privacy-preserving inference to devices
Multi-LoRA, enabling scalable, efficient deployment of personalized, fine-tuned models
LLM serving systems as the backbone of reinforcement learning inference
Semantic Caching
In Chapter 7, we discussed data parallelism where, behind a model serving endpoint, there are multiple model replicas that serve external traffic, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access