on-demand course

Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Application

with Sinan Ozdemir

April 2025

Intermediate

5h 33m

English

Pearson

Closed Captioning available in English

Watch now

Unlock full access

Includes

Badge

Course outline

Multimodal AI Essentials: Introduction
1m 51s
Topics
30s
1.1 Overview of Multimodal AI Concepts
22m 11s
1.2 Types of Data in Multimodal Systems
31m 10s
1.3 Building a Voice-to-Voice App
8m 29s
Topics
30s
2.1 Understanding VQA: Concepts and Architecture
2m 14s
2.2 Fusing Modalities to Perform VQA—Part 1
15m 27s
2.3 Fusing Modalities to Perform VQA—Part 2
17m 22s
2.4 Fusing Modalities to Perform VQA—Part 3
17m 30s
2.5 Blending Modalities to Perform VQA—Part 1
21m 58s
2.6 Blending Modalities to Perform VQA—Part 2
21m 3s
Topics
33s
3.1 Introduction to Diffusion Models
22m 45s
3.2 Hands-On: Implementing Diffusion Models with DreamBooth
25m 11s
Topics
27s
4.1 Designing Multimodal AI Systems
5m 52s
4.2 Fine-Tuning a Text-to-Speech Model with T5
29m 21s
4.3 Building Visual Agents
18m 21s
Topics
20s
5.1 Evaluating Multimodal Models: Accuracy and Performance
21m 41s
5.2 Bias and Ethics in Multimodality
17m 0s
Topics
27s
6.1 Extending Multimodal Systems with Advanced Techniques
8m 29s
6.2 Future Trends and Innovations in Multimodal AI
20m 49s
Multimodal AI Essentials: Summary
1m 51s

Overview

6+ Hours of Video Instruction

Equips you with the knowledge and skills needed to implement multimodal AI systems

Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Applications shows you how combining modalities like text, audio, video, and images can enable AI systems to achieve remarkable capabilities. You will gain hands-on experience building visual question and answering models, generating personalized images with diffusion, designing end to end multimodal applications, and even fine-tuning multimodal models for specific tasks. This video gives you the tools, knowledge, and confidence to design and deploy your own state-of-the-art multimodal AI systems.

Learn How To

Apply multimodal AI concepts
Build a voice-to-voice app
Apply visual question answering (VQA) concepts and architecture
Construct, fine-tune, and evaluate diffusion models with DreamBooth
Fine-tune a text-to-speech model with SpeechT5
Build visual agents from the ground up
Evaluate the performance of multimodal models
Extend multimodal systems with advanced techniques like computer use

About the Instructor

Sinan Ozdemir is the founder and CTO of LoopGenius, where he uses state-of-the-art AI to help people create and run their businesses. Sinan is a former lecturer of Data Science at Johns Hopkins University and the author of multiple textbooks and videos on data science and machine learning. Additionally, he is the founder of the recently acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a master's degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco, California.

Who Should Take This Course

Developers, data scientists, and engineers who are interested in building intelligent, autonomous multimodal AI systems capable of solving complex problems and adapting to dynamic environments

Course Requirements

Python 3 proficiency with some experience working in interactive Python environments including Notebooks (Jupyter/Google Colab/Kaggle Kernels)
Comfortable using the Pandas library and either Tensorflow or PyTorch
Understanding of ML/deep learning fundamentals including train/test splits, loss/cost functions, and gradient descent

Lesson Descriptions

Lesson 1: Introduction to Multimodal AI

Lesson 1 lays the groundwork for the course by introducing the core concepts of multimodal AI and its applications. It explores the significance of combining modalities like text, images, and audio to unlock a new frontier in AI development. By the end of this lesson, you will understand the transformative potential of multimodal AI systems and their impacts across industries.

Lesson 2: Building Visual Question Answering (VQA) Models

In Lesson 2 you dive into the intricacies of constructing visual question and answering (VQA) systems with Sinan, models capable of answering questions about images. Through examples and architectural walkthroughs, you learn how to embed and fuse these modalities together effectively, gaining real insights into the applications of VQA.

Lesson 3: Exploring Diffusion Models

Lesson 3 introduces diffusion, a groundbreaking approach in image generation. Unlike traditional methods, diffusion models iteratively refine noisy images to create coherent outputs. The lesson explores the theory behind both forward corruption and backwards diffusion. You also implement your own fine-tuned version of diffusion using a technique known as DreamBooth.

Lesson 4: Developing Multimodal AI Systems

Lesson 4 focuses on the practical aspects of designing and implementing multimodal AI applications. From fine-tuning text-to-speech models to building your own visual agent, the lesson demonstrates how to create cohesive systems that handle diverse input and output modalities.

Lesson 5: Evaluating and Testing Multimodal AI Systems

Lesson 5 covers evaluation metrics, benchmarks, and the ethical considerations involved in testing multimodal AI systems. It also discusses bias mitigation and responsible AI practices, covering topics like the LLMs as multimodal judges and the proliferation of Deepfakes.

Lesson 6: Expanding and Applying Multimodal AI

Lesson 6 explores advanced techniques and future trends in multimodal AI. You will see how we can extend existing AI systems with cutting edge methods, integrating novel data types. The lesson also anticipates the direction of this rapidly evolving field and its future applications, including things such as computer use for generalized AI agentic behavior.

About Pearson Video Training

Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Sams, and Que. Topics include IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780135418536

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Application

with Sinan Ozdemir

Introduction

Lesson 1: Introduction to Multimodal AI

Lesson 2: Building Visual Question Answering (VQA) Models

Lesson 3: Exploring Diffusion Models

Lesson 4: Developing Multimodal AI Systems

Lesson 5: Evaluating and Testing Multimodal AI Systems

Lesson 6: Expanding and Applying Multimodal AI

Summary