Chapter 4. Model Serving Best Practices
In Chapters 2 and 3, we explored model inference, from implementation to system design. You saw how LLMs execute internally and how to build a serving service from first principles. This chapter shifts the focus from how to build a serving system to how serving systems must evolve in real-world LLM applications.
Modern LLM applications rarely consist of a single request–response model invocation. Instead, models are embedded inside agentic workflows, enterprise platforms, and layered production systems. When this happens, model serving stops being just an inference problem—it becomes a system architecture problem. This chapter examines what changes at that system level.
We begin with agentic applications—not because this is an “agent chapter,” but because agents are now the primary pattern for building LLM-powered systems. Most modern LLM use cases—knowledge assistants, copilots, workflow automation, reasoning engines—follow an agent-like structure. A single user interaction may trigger multiple LLM calls, retrieval steps, tool execution, and iterative reasoning. These behaviors fundamentally reshape serving requirements.
Agents increase token usage, amplify tail latency across chained calls, introduce dynamic compute patterns, and require orchestration across models and tools. Many of the serving optimizations discussed later in this book—caching strategies, batching approaches, memory management, scheduling, and parallelism—are motivated ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access