Chapter 13. Multimodal Models

Chapter Goals

In this chapter you will:

Learn what is meant by a multimodal model.
Explore the inner workings of DALL.E 2, a large-scale text-to-image model from OpenAI.
Understand how CLIP and diffusion models such as GLIDE play an integral role in the overall DALL.E 2 architecture.
Analyze the limitations of DALL.E 2, as highlighted by the authors of the paper.
Explore the architecture of Imagen, a large-scale text-to-image model from Google Brain.
Learn about the latent diffusion process used by Stable Diffusion, an open source text-to-image model.
Understand the similarities and differences between DALL.E 2, Imagen, and Stable Diffusion.
Investigate DrawBench, a benchmarking suite for evaluating text-to-image models.
Learn the architectural design of Flamingo, a novel visual language model from DeepMind.
Unpick the different components of Flamingo and learn how they each contribute to the model as a whole.
Explore some of the capabilities of Flamingo, including conversational prompting.

So far, we have analyzed generative learning problems that focus solely on one modality of data: either text, images, or music. We have seen how GANs and diffusion models can generate state-of-the-art images and how Transformers are pioneering the way for both text and image generation. However, as humans, we have no difficulties crossing modalities—for example, writing a description of what is happening in a given photograph, creating ...

Get Generative Deep Learning, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Generative Deep Learning, 2nd Edition by David Foster

Chapter 13. Multimodal Models

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly