Chapter 8. Addressing Constraints
Deploying LLMs in production environments presents a unique set of challenges that go far beyond simply getting a model to work. While LLMs offer remarkable capabilities, they also demand substantial computational resources, introduce latency concerns, and can quickly become cost prohibitive at scale. The gap between a proof-of-concept that works on a single query and a production system serving thousands of users is often overlooked.
In this chapter, we provide patterns that address concerns you’re likely to face when deploying LLMs in production systems. Whether you’re facing hardware limitations, budget constraints, or strict latency requirements, the patterns presented here offer proven strategies for optimizing your LLM deployment.
We’ll explore five key patterns that tackle different aspects of production constraints. The section on the Small Language Model (Pattern 24) shows you how to reduce computational overhead through model distillation and quantization techniques. The section on Prompt Caching (Pattern 25) demonstrates how to eliminate redundant processing and reduce both costs and latency. The section on Optimizing Inference (Pattern 26) covers advanced techniques like continuous batching and speculative decoding to maximize hardware utilization. The section on Degradation Testing (Pattern 27) provides the metrics you need to validate that your LLM-based application is performing well, and it also covers actions that you can take ...