Chapter 9. LLM Optimization in Practice
Optimization is a moving target: in different environments, the “best” strategy changes. Resources are limited, so you can’t brute-force every option. To help you optimize efficiently for your own domain, we’ve carefully chosen some real examples to show how key factors—hardware configuration, model choice, memory and KV-cache behavior, distributed serving and traffic patterns—affect serving performance, and how to measure and interpret those differences. This understanding will give you the intuition to navigate your constraints and converge on a strong serving setup.
In this hands-on chapter, we put everything you learned in the previous chapters to work. Using the open source Qwen3-14B model with vLLM as our running example, we’ll walk you through a practical LLM serving optimization process and show how to scale serving both horizontally and vertically.
We’ll start with a concise optimization plan and execute it step by step—setting up the environment, preparing the evaluation workload, running experiments, serving the model on both single- and multi-GPU setups, analyzing results, and applying the techniques introduced earlier. We’ll conclude with a set of battle-tested takeaways and trade-off recommendations drawn from our own experience, which can serve as guiding principles for your future optimization work.
After reading this chapter, you’ll have a clear picture of how LLM serving optimization is done in practice—and the confidence ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access