Skip to Content
Hands-On LLM Serving and Optimization
book

Hands-On LLM Serving and Optimization

by Chi Wang, Peiheng Hu
May 2026
Intermediate to advanced
374 pages
11h 17m
English
O'Reilly Media, Inc.
Content preview from Hands-On LLM Serving and Optimization

Chapter 7. Advanced LLM Optimization Techniques

After the last chapter, you are equipped with essential techniques to combat many of the challenges of LLM serving optimization, especially those that are not overly large and fit into one GPU. For larger LLMs with, for example, more than 100 billion parameters, one GPU is usually not enough to load the model to GPU memory and generate at a satisfactory latency. In this chapter, we explore advanced techniques to further enhance LLM serving performance, including:

  • Speculative decoding to speed up the decode phase of LLM generation for faster inter-token latency (ITL)

  • Multi-GPU and multi-node serving for large LLMs that do not fit or are not performant enough when running on a single GPU

  • Prefill-decode (PD) disaggregation to decouple the prefill and decode phases and fine-tune their trade-offs independently

  • Advanced KV caching techniques to achieve lightning-fast time to first token (TTFT) and a high cache hit rate

Speculative Decoding

What if a single technique could singlehandedly improve latency—especially ITL—by a factor of two or three? Meet speculative decoding, a novel approach that is particularly useful for long, reasoning-heavy generations.

In a large ML system with, say, a million data points for retrieval or recommendation, it’s common to use a small but less accurate model to do the first round of filtering. Once you’ve significantly reduced the number of candidate data points to around 1,000, you then apply ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment

Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment

Sinan Ozdemir
Building LLMs for Production

Building LLMs for Production

Louis-Francois Bouchard, Louie Peters

Publisher Resources

ISBN: 9798341621480Errata Page