December 2025
Beginner to intermediate
360 pages
10h 48m
English
In the previous chapter, we took our first steps in connecting vision and language by learning how models can align these two very different modalities. Now we’ll build on that foundation. While our ultimate goal is text-to-image generation, we’ll first master the reverse process, image-to-text captioning, because both directions rely on the same underlying principle of learning deep cross-modal relationships.
Training a model ...
Read now
Unlock full access