book

Hands-On LLM Serving and Optimization

by Chi Wang, Peiheng Hu

April 2026

Intermediate to advanced

374 pages

11h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Foreword
Preface
Why LLM Serving and Optimization?What This Book Aims to DoWho Should Read This BookWhat This Book Isn’tHow This Book Is OrganizedHow to Use This BookWhat You’ll NeedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Model Serving and Optimization
Anatomy of a ModelModel ArchitectureModel DataModel Execution CodeModel Lifecycle: From Training to ServingWhat Is Model Serving?Why Study Model Serving?Why Optimize Model Serving (Especially for LLMs)?Example: Using a Model Serving Framework (vLLM) to Improve LLM ThroughputModel Serving ParadigmsOn-Device (Edge) ServingSingle-Model ServiceMulti-Model ServiceModel Serving PlatformsSummary
2. Large Language Model Serving
Inside the Mind of a TransformerLLM EvolutionThe Autoregressive Nature of TransformersDecoder-Only Transformer ArchitectureCapture Token Context by Calculating AttentionExecuting LLM Generation: A Step-by-Step WalkthroughRun the Qwen ModelModel Prediction, Line by LineEnable the KV Cache to Boost PerformanceThe Prefill and Decode PhasesRun the LLM with a Serving Framework Serve the LLM (Qwen) with vLLMPerformance Comparison: vLLM Versus Hugging Face TransformersLLM Streaming Serving BasicsLLM Batch Serving BasicsSummary
3. Model Serving System Design: A Deep Dive
Build an Online LLM Serving Service from ScratchDesign GoalsService ArchitectureImplement Single Generation Request HandlingBatchingStreaming with BatchingBatch Serving with vLLMA General Design for Single-Model LLM ServingRequirements for Single-Model ServingGeneral DesignBuild a Multi-Model Serving Service from ScratchDesign GoalsService ArchitectureCore ImplementationUsing NVIDIA Triton as a Model ServerTrade-offs in Multi-Model Serving DesignsChallengesA Cost-Optimized Multi-Model DesignA Latency-Optimized Multi-Model DesignSummary
4. Model Serving Best Practices
Model Serving in an Agentic WorldDefining AgentsA Sample Knowledge AgentThe Agent’s DesignThe Agent’s Internal WorkflowAgent AutonomyRetrieval-Augmented Generation (RAG)Cache-Augmented Generation (CAG)How Agents Use Model ServingLLM Serving in Enterprise Systems: An OverviewPublic API LayerResource Management LayerModel Selection and Orchestration LayerDistributed Serving LayerCore Inference LayerModel Optimization LayerModel LayerBuilding with an Open Source StackImplementing Public APIImplementing Model SelectionImplementing a Model Serving EndpointBuilding with a Cloud VendorOption 1: Fully Managed Foundation-Model ServingOption 2: One-Click Foundation-Model DeploymentOption 3: Bring Your Own ModelOption 4: Bring Your Own CodeOption 5: Bring Your Own Serving ImageOption 6: Build Your Own Serving InfrastructureComparing the OptionsBuild or Buy? Understanding StrategiesWhy Knowing How to Build Helps—Even If You Won’t BuildOur Selection StrategyMeasuring Performance in LLM ServingLatency MetricsThroughput MetricsBest Practices for Performance MeasurementSummary
5. Challenges When Serving LLMs
Why Optimizing LLM Serving is ImportantCustomer ExperienceCost EfficiencyScalability, Peak Load Handling, and FeasibilityThe Role of Accelerator Chips in LLM ServingReading GPU specsComparing the Specs of Popular GPUsBottlenecks in LLM Model LoadingThe Model Loading ProcessEstimating Model SizeEstimating KV Cache SizeBottlenecks in LLM Model ExecutionBoundaries of GPU Compute and Memory BandwidthArithmetic Intensity in Matrix MultiplicationsApplying Arithmetic Intensity Analysis to the LLM Prefill and Decode PhasesOther AI Accelerators and TrendsSummary
6. Essential LLM Optimization Techniques
Request Batching and Scheduling-Level OptimizationsWhy Do We Need Batching in Real-Time Serving?Dynamic Batching in Online InferenceContinuous Batching for LLM Online InferenceContinuous Batching with Chunked PrefillScaling Attention and Kernel OptimizationScalable Attention MechanismsKernel Fusion and Custom Attention KernelsModel CompressionQuantizationDistillationPruningPrefix CachingRadixAttentionUse CasesBest PracticesScaling Prefix CacheSummary
7. Advanced LLM Optimization Techniques
Speculative DecodingDetailed StepsTuning and UsageHands-on Speculative DecodingMulti-GPU and Multi-Node InferencingData ParallelismTensor Parallelism and Pipeline ParallelismExpert ParallelismPrefill-Decode DisaggregationOverall ArchitectureKV Cache TransferWhen to UseAdvanced KV CachingLong-Context ServingCost and Latency CalculationsSelf-Hosting LLMsHands-on LMCacheSummary
8. LLM Serving Frameworks
Why We Need Specialized LLM Serving FrameworksvLLMvLLM’s ArchitectureModel Initialization Workflow (with Multi-Process Worker)Generation-Request Execution WorkflowScheduler Deep DivevLLM’s Layered Optimization StrategyTensorRT-LLMSGLangLlama.cppSelecting the Right FrameworkSummary

9. LLM Optimization in Practice
LLM Serving Optimization PlanOptimize Qwen3-14B serving with vLLMStep 1: Examine the GPU hardwareStep 2: Generate Benchmark TrafficStep 3: Define Evaluation MetricsStep 4: Set Up the Model Serving ServerStep 5: Benchmark the Qwen3 Model with vLLMStep 6: Benchmark the Quantized Qwen3 Model with vLLMStep 7: Apply Additional Optimization TechniquesStep 8: Benchmark the Qwen3 Model with Distributed ServingCommon Optimization Trade-0ffsSummary
10. Advancements in LLM Serving
Semantic CachingPerformance Profiling StrategiesMultimodal ServingMultimodal Input ProcessingArchitectural and System ImplicationsEdge AI: Drivers and EnablersSpecialized Low-Power HardwareModel Compression and OptimizationHeterogeneous ComputeThermal-Aware SchedulingEdge–Cloud Hybrid ComputeMulti-LoRA ServingModel Serving in Reinforcement LearningLLM Serving in RLDeterminism in RL ServingSummary
Index
About the Authors

Content preview from Hands-On LLM Serving and Optimization

Index

Symbols

2:4 structured sparsity, Pruning

A

“Accelerating Large Language Model Decoding with Speculative Sampling” (Chen et al.), Detailed Steps
accelerator chips, The Role of Accelerator Chips in LLM Serving-Comparing the Specs of Popular GPUs
- AI accelerators in system on a chip, Specialized Low-Power Hardware
- chips competing with NVIDIA, Other AI Accelerators and Trends-Other AI Accelerators and Trends
  - NVIDIA dominating the market, Other AI Accelerators and Trends
- data movement gains lagging compute power, Other AI Accelerators and Trends
- memory wall, Other AI Accelerators and Trends-Other AI Accelerators and Trends
- NVIDIA GPUs
  - dominating the market, Other AI Accelerators and Trends
  - LLM model execution, Bottlenecks in LLM Model Execution-Applying Arithmetic Intensity Analysis to the LLM Prefill and Decode Phases
  - LLM model loading, Bottlenecks in LLM Model Loading-Estimating KV Cache Size
- reading GPU specs, Reading GPU specs-GPU power consumption
  - comparing popular GPUs, Comparing the Specs of Popular GPUs
- trends, Other AI Accelerators and Trends
advancements in LLM serving
- edge AI, Edge AI: Drivers and Enablers-Edge–Cloud Hybrid Compute
  - edge–cloud hybrid compute, Edge–Cloud Hybrid Compute
  - heterogeneous compute, Heterogeneous Compute
  - model compression and optimization, Model Compression and Optimization
  - specialized low-power hardware, Specialized Low-Power Hardware
  - thermal-aware scheduling, Thermal-Aware Scheduling
- model serving in reinforcement learning, Model Serving in Reinforcement Learning
  - determinism ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment

Publisher Resources

ISBN: 9798341621480Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hands-On LLM Serving and Optimization

by Chi Wang, Peiheng Hu