Hands-On Multimodal AI

Intermediate

Build applications with text, image, audio, and video

What you’ll learn and how you can apply it

Understand how multimodal AI systems extend the transformer architecture across text, vision, and audio
Prepare and format multimodal inputs for LVLMs and audio foundation models
Implement practical multimodal tasks such as video question answering, speech transcription, and text extraction from visual content
Develop real-world workflows that integrate multimodal understanding
Evaluate and compare model outputs across modalities

Course description

Current state-of-the-art open source models can process images, videos, and text within a single framework and return structured answers. They can also extract information from videos or screenshots, identify what is happening in a scene, and predict the sentiment or age of a speaker in an audio or video recording. In this course, expert Nicole Koenigstein helps you gain the skills to leverage the full spectrum of modern multimodal AI for your applications.

First, Nicole will explain how the concept of text tokenization extends to images, audio, and video and how to format prompts that combine different media types. Then, through hands-on examples with LVLMs and audio foundation models, you’ll apply techniques for video question answering, audio classification, speech transcription, and video analysis and text extraction. You’ll also learn how to build real-world workflows such as meeting transcription with automatic speaker identification and segmentation; multimodal retrieval and video understanding; and multimodal retrieval-augmented generation (RAG) that integrates tables and images within text.

This live event is for you because...

You’re an AI or ML engineer who wants to extend your language-model knowledge to multimodal systems.
You’re a data scientist or technical researcher exploring models that combine vision, audio, and text understanding.
You’re a developer or applied scientist who’s building real-world solutions for document analysis, transcription, content moderation, or intelligent multimodal assistants
You’re a technical lead who wants a clear understanding of what current multimodal models can and can’t do.

Prerequisites

A Python 3.12 environment set up (best in Google Colab)
Dependencies from the provided GitHub repository installed on your computer
A token key for Hugging Face
Colab Pro to run the open source models in a Jupyter notebook
Intermediate-level Python programming experience (classes, functions, loops)
Basic understanding of large language models
Familiarity with the Hugging Face Transformers framework

Recommended follow-up:

Read chapters 4, 5, 6, and 11 in Transformers: The Definitive Guide (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Multimodal foundations and tokenization (60 minutes)

Presentation: What makes a model multimodal; how the transformer architecture extends from text to images, audio, and video; understanding tokenization across modalities and why it matters
Demonstration: Converting text, images, and audio samples into tokens and embeddings, showing how different data types align in the same transformer space
Hands-on exercise: Run a short Colab notebook that tokenizes an image and an audio clip
Q&A
Break

Working with images and videos (45 minutes)

Presentation: How LVLMs process visual sequences; frame extraction, visual context windows, and how prompts combine visual and textual information
Demonstration: Asking a model about images or videos (for example, “How many bottles of drinks does the woman pick up?”)
Hands-on exercise: Apply video and image question answering and information extraction from short screen recordings and images
Q&A
Break

Understanding audio and speech (45 minutes)

Presentation: Converting an audio sample into a token sequence through spectrograms and mel-frequency features, and why this matters for the model; how the sampling rate affects model accuracy
Demonstration: Classifying different environmental sounds and predicting speaker attributes, such as emotion or age, using an audio foundation model
Hands-on exercise: Extract audio from a video file for audio Q&A to reduce memory usage when visual data is not required
Q&A
Break

Building real-world multimodal workflows (45 minutes)

Presentation: Integrating multiple modalities into end-to-end tasks such as meeting transcription and speaker identification
Demonstration: Preparing audio data to automatically extract speaker segments and convert them into text from an audio recording
Hands-on exercise: Use an instruction-following audio-text model to extract speaker segments and text from a meeting recording to perform automatic transcription, summarization, and Q&A over audio
Q&A
Break

Retrieval-augmented generation with multimodal inputs (45 minutes)

Presentation: Extending RAG to multimodal contexts to incorporate tables, charts, and images
Demonstration: Retrieving and combining text and images to answer complex queries using a multimodal RAG pipeline
Hands-on exercise: Extract structured data from tables in a PDF and ask questions about its content
Q&A

Your Instructor

Nicole Koenigstein
Nicole Koenigstein is an independent data scientist and quantitative researcher as well as an AI consultant, leading workshops and guiding companies from AI concept to deployment. Previously, she was CEO and cochief AI officer at Quantmate. Nicole is the author of the books Mathematics for Machine Learning with NLP and Python and Transformers in Action (Manning) and the forthcoming books AI Agents: The Definitive Guide and Transformers: The Definitive Guide for O’Reilly. She shares her expertise in Python, machine learning, and deep learning as a guest lecturer at various universities.

search

Skills covered

Generative AI

Retrieval Augmented Generation (RAG)
Video Production

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills