Compressing Knowledge
As a result of the large language model boom, there has been significant innovation in techniques for model compression at training and inference time. The two most prominent examples of this are quantization and low-rank adaptation (LoRA).
Quantization is the process of converting a continuous range of values to a finite range of values. Quantization has been around for a very long time. It’s fundamental to the fields of digital signal processing. In the context of machine learning, quantization refers to the quantization of machine learning model weights. Models are typically trained in 32-bit, 16-bit (half), or mixed (both) precision. Post-training quantization is the process of taking these 32-bit or 16-bit model weights ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access