Chapter 3. Model Serving System Design: A Deep Dive
In Chapter 1, we introduced the major model serving paradigms, outlining common architectural patterns and trade-offs. In Chapter 2, we examined how LLMs perform inference and generate text at the model level. This chapter bridges those foundations to production engineering: how to organize code and infrastructure to construct complete serving systems for both single-model and multi-model scenarios.
Model serving is a rapidly evolving field, with hundreds of open source serving frameworks and commercial solutions available. Evaluating, adopting, and customizing the right solution can quickly become overwhelming. Rather than starting with a specific framework, we focus in this chapter on building intuition from first principles. By understanding how serving systems are structured at a fundamental level, you’ll be better equipped to reason about any framework or managed service.
To that end, we develop two simplified yet representative serving systems: one for single-model LLM serving and one for multi-model serving. These implementations are intentionally streamlined—they are not meant to replace production frameworks like Triton or vLLM—but they capture the core components and architectural decisions that define real-world systems. Through these examples, you will see how batching, streaming, routing, isolation, and resource management fit together in practice.
We begin by constructing a single-model LLM serving service that ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access