Chapter 6. Essential LLM Optimization Techniques
In prior chapters, we demonstrated the importance and the challenges of optimizing LLMs for serving. In the next two chapters, we will dive deep into each of the critical LLM optimization techniques one by one so that you are equipped with the knowledge to decide when, how, and why to use them for your serving needs.
In this chapter specifically, we will focus on essential techniques that will help you understand most optimization concepts and achieve a lot of your optimization goals. We’ll leave the more advanced techniques and industry trends for Chapter 7.
In this chapter, we will discuss how to use:
-
Request batching and scheduling to achieve better parallelism and GPU utilization
-
Attention optimization to achieve better compute efficiency, less required compute, and better memory management
-
Model compression to achieve smaller models, less memory movement, and/or less compute
-
Prefix caching to cache and reuse prior prompts, including how to do it efficiently and obtain a high cache-hit rate
Request Batching and Scheduling-Level Optimizations
In Chapter 2, we divided serving into offline serving and real-time online serving. In real-time online serving, requests are received as the user sends them, while for offline serving, we have the requests already and can batch them all together so that they form a big tensor input that gets fed into the model, instead of sending them one by one.
Grouping requests together ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access