Chapter 7. Advanced LLM Optimization Techniques
After the last chapter, you are equipped with essential techniques to combat many of the challenges of LLM serving optimization, especially those that are not overly large and fit into one GPU. For larger LLMs with, for example, more than 100 billion parameters, one GPU is usually not enough to load the model to GPU memory and generate at a satisfactory latency. In this chapter, we explore advanced techniques to further enhance LLM serving performance, including:
-
Speculative decoding to speed up the decode phase of LLM generation for faster inter-token latency (ITL)
-
Multi-GPU and multi-node serving for large LLMs that do not fit or are not performant enough when running on a single GPU
-
Prefill-decode (PD) disaggregation to decouple the prefill and decode phases and fine-tune their trade-offs independently
-
Advanced KV caching techniques to achieve lightning-fast time to first token (TTFT) and a high cache hit rate
Speculative Decoding
What if a single technique could singlehandedly improve latency—especially ITL—by a factor of two or three? Meet speculative decoding, a novel approach that is particularly useful for long, reasoning-heavy generations.
In a large ML system with, say, a million data points for retrieval or recommendation, it’s common to use a small but less accurate model to do the first round of filtering. Once you’ve significantly reduced the number of candidate data points to around 1,000, you then apply ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access