Chapter 10. Any to Any Models
Multimodal models are great because they ground language in the world we actually live in. VLMs can see images and answer questions in natural language. Video models understand how actions unfold, cause-and-effect over time. Audio models can hear speech and sound. All of them bring AI closer to human-like perception.
What if you want to build an assistant that can see, hear and speak? With what we have seen until now, you would need a vision model, an audio model, a speech synthesis model and careful orchestration among them.
The next frontier is any-to-any models: unified architectures that handle understanding and generation across text, images, audio and video within a single system (Figure 10-1).
Figure 10-1. Overview of a unified multimodal model supporting understanding and generation across text, images, video, and audio.
To be truly any-to-any, a model ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access