Chapter 7. Lean Inference
If a tree falls in the forest and no one is around to hear it, does it make a sound? We have all heard this idiom in different settings. For AI practitioners, the related question should be if a model is developed and no one uses it, what happens to all the resources used in its development?
All AI models are developed with the hope they will be used extensively. It is not a given that all models will have takers, though. This brings us to a philosophical question on how to allocate resources at the outset when developing any resource-intensive technology.1 We will discuss this dilemma further in Chapter 9.
In this chapter, we will focus on the resource efficiency and sustainability of AI models at deployment. Technically, using an AI model after training for prediction purposes is known as inference. I first present an overview of the inference costs of modern AI models and then look at some effective methods to improve these costs. Many of the methods discussed in Chapter 6 for improving training efficiency, such as quantization and neural network pruning, can also be used to achieve lean inference. In addition, we will consider specialized methods that can accelerate inference of AI models by translating high-level implementations to more efficient, lower-level programming languages such as C++.
Lifetime Cost of an AI Model
Consider the inference cost of a large GenAI model, such as Llama-3-405B. Let’s assume the energy consumption per prompt is ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access