Skip to Content
Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Application
on-demand course

Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Application

with Sinan Ozdemir
April 2025
Intermediate
5h 33m
English
Pearson
Closed Captioning available in English

Overview

6+ Hours of Video Instruction

Equips you with the knowledge and skills needed to implement multimodal AI systems

Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Applications shows you how combining modalities like text, audio, video, and images can enable AI systems to achieve remarkable capabilities. You will gain hands-on experience building visual question and answering models, generating personalized images with diffusion, designing end to end multimodal applications, and even fine-tuning multimodal models for specific tasks. This video gives you the tools, knowledge, and confidence to design and deploy your own state-of-the-art multimodal AI systems.


Learn How To

  • Apply multimodal AI concepts
  • Build a voice-to-voice app
  • Apply visual question answering (VQA) concepts and architecture
  • Construct, fine-tune, and evaluate diffusion models with DreamBooth
  • Fine-tune a text-to-speech model with SpeechT5
  • Build visual agents from the ground up
  • Evaluate the performance of multimodal models
  • Extend multimodal systems with advanced techniques like computer use

About the Instructor

Sinan Ozdemir is the founder and CTO of LoopGenius, where he uses state-of-the-art AI to help people create and run their businesses. Sinan is a former lecturer of Data Science at Johns Hopkins University and the author of multiple textbooks and videos on data science and machine learning. Additionally, he is the founder of the recently acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a master's degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco, California.

Who Should Take This Course

  • Developers, data scientists, and engineers who are interested in building intelligent, autonomous multimodal AI systems capable of solving complex problems and adapting to dynamic environments

Course Requirements

  • Python 3 proficiency with some experience working in interactive Python environments including Notebooks (Jupyter/Google Colab/Kaggle Kernels)
  • Comfortable using the Pandas library and either Tensorflow or PyTorch
  • Understanding of ML/deep learning fundamentals including train/test splits, loss/cost functions, and gradient descent

Lesson Descriptions

Lesson 1: Introduction to Multimodal AI

Lesson 1 lays the groundwork for the course by introducing the core concepts of multimodal AI and its applications. It explores the significance of combining modalities like text, images, and audio to unlock a new frontier in AI development. By the end of this lesson, you will understand the transformative potential of multimodal AI systems and their impacts across industries.

Lesson 2: Building Visual Question Answering (VQA) Models

In Lesson 2 you dive into the intricacies of constructing visual question and answering (VQA) systems with Sinan, models capable of answering questions about images. Through examples and architectural walkthroughs, you learn how to embed and fuse these modalities together effectively, gaining real insights into the applications of VQA.

Lesson 3: Exploring Diffusion Models

Lesson 3 introduces diffusion, a groundbreaking approach in image generation. Unlike traditional methods, diffusion models iteratively refine noisy images to create coherent outputs. The lesson explores the theory behind both forward corruption and backwards diffusion. You also implement your own fine-tuned version of diffusion using a technique known as DreamBooth.

Lesson 4: Developing Multimodal AI Systems

Lesson 4 focuses on the practical aspects of designing and implementing multimodal AI applications. From fine-tuning text-to-speech models to building your own visual agent, the lesson demonstrates how to create cohesive systems that handle diverse input and output modalities.

Lesson 5: Evaluating and Testing Multimodal AI Systems

Lesson 5 covers evaluation metrics, benchmarks, and the ethical considerations involved in testing multimodal AI systems. It also discusses bias mitigation and responsible AI practices, covering topics like the LLMs as multimodal judges and the proliferation of Deepfakes.

Lesson 6: Expanding and Applying Multimodal AI

Lesson 6 explores advanced techniques and future trends in multimodal AI. You will see how we can extend existing AI systems with cutting edge methods, integrating novel data types. The lesson also anticipates the direction of this rapidly evolving field and its future applications, including things such as computer use for generalized AI agentic behavior.

About Pearson Video Training

Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Sams, and Que. Topics include IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Generative AI in Action

Generative AI in Action

Amit Bahree

Publisher Resources

ISBN: 9780135418536