Chapter 2. Vision Language Model Applications
Vision Language Models (VLMs) are models that can interpret both image and text. VLMs, typically trained on massive vision, vision-language, and language datasets, can accomplish a wide variety of vision tasks like identifying objects, actions, and scenes, as well as understand text for multimodal tasks.
In this chapter, take a look at multimodal tasks and models, how they are employed practically, and how to evaluate them.
Image Captioning
Image captioning is the task of describing the visual content of an image in natural language, see Figure 2-1 to quickly get it. Image captioning sits at the intersection of computer vision and natural language processing. It requires a model to understand an image (identify objects, attributes, and their relationships) and generate a coherent sentence describing the image. This task is a fundamental problem in artificial intelligence that connects vision and language.
Figure 2-1. Image captioning input, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access