CHAPTER 3The Rise of Multimodal Learning Models
When you hear the term artificial intelligence (AI), what springs to mind? For many, it's probably a chatbot like ChatGPT—type in a question and get a humanlike response. But that's just the tip of the AI iceberg. To truly revolutionize our world, AI needs to evolve beyond text‐based interactions.
Enter multimodal learning, or what some call the Internet of Senses. This approach aims to make AI more humanlike by integrating multiple types of data—text, images, video, and sound—into a single, cohesive learning model. Although we've had lifelong training in these nuanced communication codes, AI hasn't. By embracing a multimodal approach, we're not just improving AI's ability to understand us; we're opening doors to a wide array of practical applications that could transform how we interact with technology.
Why is multimodal learning crucial? Because human communication is complex. We don't just rely on words; we interpret facial expressions, tone of voice, and contextual cues. As any Star Trek fan knows, even logical Spock found humans puzzlingly “illogical” at times.
Multimodal learning brings AI closer to humanlike perception and cognition. By combining different data types, these models can generate richer, more subtle insights vs nuanced insights by understanding not just our words, but our meaning. Similarly, recognizing an object might involve both its visual appearance and the sound it makes. This comprehensive data integration ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access