Inference Optimization for LLMs with vLLM
Published by O'Reilly Media, Inc.
Techniques for making large language models faster and easier to use
What you’ll learn and how you can apply it
- Understand the principles and applications of different LLM decoding approaches
- Explore core LLM inference bottlenecks and memory constraints
- Discover how to accelerate model performance using advanced attention mechanisms, strategic decoding methods, and iterative supervised fine-tuning
- Explore popular libraries to effectively deploy and manage large models in production
Course description
As large language models (LLMs) have become more widespread, there is a growing emphasis on using them effectively in production. In particular, speeding up inference has emerged as a key concern, as it has been a bottleneck for production deployment use cases. Over the past two years, various strategies—ranging from specialized attention mechanisms to advanced decoding approaches—have been proposed. The choice of inference server and specific deployment constraints can also shape training methods and recipes.
Join expert Isha Chaturvedi to get an in-depth look at how to optimize LLM inference and reduce latency with state-of-the-art techniques that participants can immediately apply to real-world projects.
This live event is for you because...
- You’re an AI/ML engineer or enthusiast who seeks deeper knowledge of LLM inference.
- You’re an engineer who’s optimizing LLM inference and deploying in production.
- You want to learn practical, state-of-the-art methods for LLM speed and reliability.
Prerequisites
- Access to Google Colab, with Python and NumPy installed
- Intermediate knowledge of neural networks and transformers
- Hands-on Python knowledge (including reading/writing code, basic knowledge of NumPy, Transformers, and the PyTorch library)
- An understanding of essential concepts like loss function, model freezing
Recommended preparation:
- (Optional) Take Fine-tuning Open-Source Large Language Models (live online course with Christian Winkler)
Recommended follow-up:
- Read Designing Large Language Model Applications (book)
- Read Hands-On Large Language Models (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introduction to large language model inference (70 minutes)
- Presentation: Major challenges (large model sizes, memory/compute constraints, and long-context inputs); tokenization basics; prefill phase versus decode phase; key-value; caching and its memory implications; batching; LLM memory requirements
- Group discussion: Real-world deployment setups (batch sizes, concurrency, memory constraints); when decode-time overhead becomes problematic
- Q&A
- Break
Scaling LLM inference with efficient attention and parallelism (65 minutes)
- Presentation: Advanced attention mechanisms and memory management (multi-head, multi-query, grouped-query attention; flash attention and paged attention)
- Group discussion: Model parallelization techniques (pipeline, tensor, and sequence parallelism)
- Hands-on exercise: Code simplified flash and paged attention and benchmark performance differences against different sequence length in Pytorch
- Q&A
- Break
Efficient decoding and model optimization for real-world inference (65 minutes)
- Presentation: Speculative and constraint decoding, multi-token decoding, quantization, sparsity, and distillation; inference libraries and serving solutions (vLLM, Text Generation Inference, Nvidia Dynamo); inference systems (Groq, SambaNova Systems, etc.)
- Q&A
- Break
Hands-on inference with vLLM and DeepSeek-V3 (40 minutes)
- Hands-on exercise: Explore vLLM library by running DeepSeek-V3 with vLLM in Google Colab
- Group discussion: Practical issues that can come up during use of inference libraries
- Q&A
Your Instructor
Isha Chaturvedi
Isha Chaturvedi is an applied scientist on the Amazon AGI team. An AI researcher with over six years of industry experience in machine learning, she specializes in natural language understanding (NLU) and GenAI, and holds multiple patents in the NLU domain. She has worked at multiple companies, including Articul8, Capital One, and Ericsson and is active in open source projects. She’s an author of the Maya paper (an instruction-finetuned multilingual multimodal model), with Cohere for AI. She has also contributed to research at the Urban Observatory and Sounds of New York City labs at New York University, where she earned her master’s degree in urban data science. She completed her undergraduate studies in environmental technology and computer science at the Hong Kong University of Science and Technology (HKUST). As a research assistant at the HKUST-Deutsche Telekom Systems and Media Lab, she worked on augmented reality and computer vision. Isha has also served as an advisory board member at the University of California, Riverside.