Chapter 8. Video-Language Models
Deep learning has significantly advanced our ability to understand images. Videos, on the other hand, contain a sequence of images over time, showing motion and unfolding events. Extending image models to handle video is challenging because treating each frame individually ignores the important connections between them. For example, an image model might easily identify objects or people in each frame, but without understanding the sequence, it can’t tell if a person is entering a room or leaving it.
Figure 8-1. We provide some frames from a video and their timestamps to a video language model, and we get answers from it.
As you will learn through this chapter, a simple “frame-by-frame” approach would fail to distinguish actions that rely on the order of events, such as opening vs closing a door. Additionally, because videos contain so much data spread across ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access