Chapter 13. Multimodal Models

So far, we have analyzed generative learning problems that focus solely on one modality of data: either text, images, or music. We have seen how GANs and diffusion models can generate state-of-the-art images and how Transformers are pioneering the way for both text and image generation. However, as humans, we have no difficulties crossing modalities—for example, writing a description of what is happening in a given photograph, creating ...

Get Generative Deep Learning, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.