Hands-On Multimodal AI
Published by O'Reilly Media, Inc.
Build applications with text, image, audio, and video
What you’ll learn and how you can apply it
- Understand how multimodal AI systems extend the transformer architecture across text, vision, and audio
- Prepare and format multimodal inputs for LVLMs and audio foundation models
- Implement practical multimodal tasks such as video question answering, speech transcription, and text extraction from visual content
- Develop real-world workflows that integrate multimodal understanding
- Evaluate and compare model outputs across modalities
Course description
Current state-of-the-art open source models can process images, videos, and text within a single framework and return structured answers. They can also extract information from videos or screenshots, identify what is happening in a scene, and predict the sentiment or age of a speaker in an audio or video recording. In this course, expert Nicole Koenigstein helps you gain the skills to leverage the full spectrum of modern multimodal AI for your applications.
First, Nicole will explain how the concept of text tokenization extends to images, audio, and video and how to format prompts that combine different media types. Then, through hands-on examples with LVLMs and audio foundation models, you’ll apply techniques for video question answering, audio classification, speech transcription, and video analysis and text extraction. You’ll also learn how to build real-world workflows such as meeting transcription with automatic speaker identification and segmentation; multimodal retrieval and video understanding; and multimodal retrieval-augmented generation (RAG) that integrates tables and images within text.
This live event is for you because...
- You’re an AI or ML engineer who wants to extend your language-model knowledge to multimodal systems.
- You’re a data scientist or technical researcher exploring models that combine vision, audio, and text understanding.
- You’re a developer or applied scientist who’s building real-world solutions for document analysis, transcription, content moderation, or intelligent multimodal assistants
- You’re a technical lead who wants a clear understanding of what current multimodal models can and can’t do.
Prerequisites
- A Python 3.12 environment set up (best in Google Colab)
- Dependencies from the provided GitHub repository installed on your computer
- A token key for Hugging Face
- Colab Pro to run the open source models in a Jupyter notebook
- Intermediate-level Python programming experience (classes, functions, loops)
- Basic understanding of large language models
- Familiarity with the Hugging Face Transformers framework
Recommended follow-up:
- Read chapters 4, 5, 6, and 11 in Transformers: The Definitive Guide (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Multimodal foundations and tokenization (60 minutes)
- Presentation: What makes a model multimodal; how the transformer architecture extends from text to images, audio, and video; understanding tokenization across modalities and why it matters
- Demonstration: Converting text, images, and audio samples into tokens and embeddings, showing how different data types align in the same transformer space
- Hands-on exercise: Run a short Colab notebook that tokenizes an image and an audio clip
- Q&A
- Break
Working with images and videos (45 minutes)
- Presentation: How LVLMs process visual sequences; frame extraction, visual context windows, and how prompts combine visual and textual information
- Demonstration: Asking a model about images or videos (for example, “How many bottles of drinks does the woman pick up?”)
- Hands-on exercise: Apply video and image question answering and information extraction from short screen recordings and images
- Q&A
- Break
Understanding audio and speech (45 minutes)
- Presentation: Converting an audio sample into a token sequence through spectrograms and mel-frequency features, and why this matters for the model; how the sampling rate affects model accuracy
- Demonstration: Classifying different environmental sounds and predicting speaker attributes, such as emotion or age, using an audio foundation model
- Hands-on exercise: Extract audio from a video file for audio Q&A to reduce memory usage when visual data is not required
- Q&A
- Break
Building real-world multimodal workflows (45 minutes)
- Presentation: Integrating multiple modalities into end-to-end tasks such as meeting transcription and speaker identification
- Demonstration: Preparing audio data to automatically extract speaker segments and convert them into text from an audio recording
- Hands-on exercise: Use an instruction-following audio-text model to extract speaker segments and text from a meeting recording to perform automatic transcription, summarization, and Q&A over audio
- Q&A
- Break
Retrieval-augmented generation with multimodal inputs (45 minutes)
- Presentation: Extending RAG to multimodal contexts to incorporate tables, charts, and images
- Demonstration: Retrieving and combining text and images to answer complex queries using a multimodal RAG pipeline
- Hands-on exercise: Extract structured data from tables in a PDF and ask questions about its content
- Q&A
Your Instructor
Nicole Koenigstein
Nicole Koenigstein is an independent data scientist and quantitative researcher as well as an AI consultant, leading workshops and guiding companies from AI concept to deployment. Previously, she was CEO and cochief AI officer at Quantmate. Nicole is the author of the books Mathematics for Machine Learning with NLP and Python and Transformers in Action (Manning) and the forthcoming books AI Agents: The Definitive Guide and Transformers: The Definitive Guide for O’Reilly. She shares her expertise in Python, machine learning, and deep learning as a guest lecturer at various universities.
Skills covered
- Generative AI
- Retrieval Augmented Generation (RAG)
- Video Production