Chapter 10. Optimizing AI Services
In this chapter, you’ll learn to further optimize your services via prompt engineering, model quantization, and caching mechanisms.
Optimization Techniques
The objectives of optimizing an AI service are to either improve output quality or performance (latency, throughput, costs, etc.).
Performance-related optimizations include the following:
-
Using batch processing APIs
-
Caching (keyword, semantic, context, or prompt)
-
Model quantization
Quality-related optimizations include the following:
-
Using structured outputs
-
Prompt engineering
-
Model fine-tuning
Let’s review each in more detail.
Batch Processing
Often you want an LLM to process batches of entries at the same time. The most obvious solution is to submit multiple API calls per entry. However, the obvious approach can be costly and slow and may lead to your model provider rate limiting you.
In such cases, you can leverage two separate techniques for batch processing your data through an LLM:
-
Updating your structured output schemas to return multiple examples at ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access