Chapter 5. Transformers for Video Generation
This chapter will be a fun ride, because a lot of what you’ve learned so far now comes together in SOTA text-to-video (T2V) and image-to-video (I2V) models. You’ve already seen how ViT restructures images as patches (“Embeddings and Tokenization for Vision Models”) and how DiT extends this into generative diffusion (“Scalable Diffusion Models with Transformers”). In fact, many T2V and I2V models build on DiT by adding a temporal dimension, stacking latent patches across time. And remember those rotary positional embeddings from “Longer Context Windows with Better Performance”? You’ll see them here again too.
Since you already understand how T2I works, T2V and I2V become less of a leap and more of a natural generalization. You just move from generating a single frame to generating a coherent sequence. Most T2V and I2V models can also generate static images and even 3D views using the same core backbone. Most of the time, the architecture stays the same. Only the axis grows.
For me, that’s the elegance of transformers. Once you understand how they work, you start to see how everything fits together, across image, video, audio, and more. That’s exactly why I chose to write this book about transformers beyond language: because the deeper insight is not in treating each domain separately, as with other models in deep learning, but in realizing how naturally the architecture extends across different domains.
Hence, in this chapter I’ll take ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access