Chapter 6. Tuning and Infrastructure
You built a capable customer service agent in Chapters 3 and 4. It handles requests across text, images, and video, and routes complex queries to specialist agents when needed. You followed Chapter 5’s evaluation framework, measuring performance, iterating on prompts, and refining coordination patterns. The system works.
But as you prepare for production, new questions emerge. What if response times need to drop by 50%? What if the request volume scales to millions per day? What if your domain vocabulary—the specific terminology and patterns unique to your business—proves too specialized for the base model to handle reliably? When prompt engineering and agent design reach their limits, what comes next?
This chapter explores the deeper interventions that become necessary at scale. We’ll examine when fine-tuning justifies its costs, how to implement it efficiently, and how to build inference infrastructure that balances performance, cost, and operational complexity.
The Tuning Decision
These questions about latency, scale, and domain specialization don’t all have the same answer. Some point toward fine-tuning. Others don’t. To see why, consider a financial services client whose agent struggled with two problems. First, it missed fraud signals, failing to recognize when routine-sounding inquiries were actually red flags for account takeover. The client had tried detailed prompts describing fraud patterns, but even with regular updates, the model ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access