Inference Optimization for LLMs with vLLM

Advanced

Techniques for making large language models faster and easier to use

What you’ll learn and how you can apply it

Understand the principles and applications of different LLM decoding approaches
Explore core LLM inference bottlenecks and memory constraints
Discover how to accelerate model performance using advanced attention mechanisms, strategic decoding methods, and iterative supervised fine-tuning
Explore popular libraries to effectively deploy and manage large models in production

Course description

As large language models (LLMs) have become more widespread, there is a growing emphasis on using them effectively in production. In particular, speeding up inference has emerged as a key concern, as it has been a bottleneck for production deployment use cases. Over the past two years, various strategies—ranging from specialized attention mechanisms to advanced decoding approaches—have been proposed. The choice of inference server and specific deployment constraints can also shape training methods and recipes.

Join expert Isha Chaturvedi to get an in-depth look at how to optimize LLM inference and reduce latency with state-of-the-art techniques that participants can immediately apply to real-world projects.

This live event is for you because...

You’re an AI/ML engineer or enthusiast who seeks deeper knowledge of LLM inference.
You’re an engineer who’s optimizing LLM inference and deploying in production.
You want to learn practical, state-of-the-art methods for LLM speed and reliability.

Prerequisites

Access to Google Colab, with Python and NumPy installed
Intermediate knowledge of neural networks and transformers
Hands-on Python knowledge (including reading/writing code, basic knowledge of NumPy, Transformers, and the PyTorch library)
An understanding of essential concepts like loss function, model freezing

Recommended preparation:

(Optional) Take Fine-tuning Open-Source Large Language Models (live online course with Christian Winkler)

Recommended follow-up:

Read Designing Large Language Model Applications (book)
Read Hands-On Large Language Models (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction to large language model inference (70 minutes)

Presentation: Major challenges (large model sizes, memory/compute constraints, and long-context inputs); tokenization basics; prefill phase versus decode phase; key-value; caching and its memory implications; batching; LLM memory requirements
Group discussion: Real-world deployment setups (batch sizes, concurrency, memory constraints); when decode-time overhead becomes problematic
Q&A
Break

Scaling LLM inference with efficient attention and parallelism (65 minutes)

Presentation: Advanced attention mechanisms and memory management (multi-head, multi-query, grouped-query attention; flash attention and paged attention)
Group discussion: Model parallelization techniques (pipeline, tensor, and sequence parallelism)
Hands-on exercise: Code simplified flash and paged attention and benchmark performance differences against different sequence length in Pytorch
Q&A
Break

Efficient decoding and model optimization for real-world inference (65 minutes)

Presentation: Speculative and constraint decoding, multi-token decoding, quantization, sparsity, and distillation; inference libraries and serving solutions (vLLM, Text Generation Inference, Nvidia Dynamo); inference systems (Groq, SambaNova Systems, etc.)
Q&A
Break

Hands-on inference with vLLM and DeepSeek-V3 (40 minutes)

Hands-on exercise: Explore vLLM library by running DeepSeek-V3 with vLLM in Google Colab
Group discussion: Practical issues that can come up during use of inference libraries
Q&A

Your Instructor

Isha Chaturvedi
Isha Chaturvedi is an applied scientist on the Amazon AGI team. An AI researcher with over six years of industry experience in machine learning, she specializes in natural language understanding (NLU) and GenAI, and holds multiple patents in the NLU domain. She has worked at multiple companies, including Articul8, Capital One, and Ericsson and is active in open source projects. She’s an author of the Maya paper (an instruction-finetuned multilingual multimodal model), with Cohere for AI. She has also contributed to research at the Urban Observatory and Sounds of New York City labs at New York University, where she earned her master’s degree in urban data science. She completed her undergraduate studies in environmental technology and computer science at the Hong Kong University of Science and Technology (HKUST). As a research assistant at the HKUST-Deutsche Telekom Systems and Media Lab, she worked on augmented reality and computer vision. Isha has also served as an advisory board member at the University of California, Riverside.

linkedin search

Skill covered

Large Language Models (LLMs)

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills