Chapter 7. Audiovisual Synchronization with Cross-Modal Fusion
Learning objective: In this chapter, you’ll learn how to generate synchronized audio and video simultaneously using cross-modal fusion architectures. By the end of this chapter, you will build a working audio-video generation system using the LTX-2 framework, understand the unified DiT architecture with bidirectional audio-video attention, master the audio VAE pipeline from mel-spectrograms to neural vocoders, and implement two-stage flow matching inference, ensuring that what you see and hear align perfectly in time and meaning.
In the preceding chapters, we’ve generated silent videos and built systems that understand visual content. But authentic video experiences are inherently multimodal: a crackling fire, footsteps on gravel, the rhythm of speech. The video generation systems from Chapters 3 through 5 produce the modern equivalent of silent films: beautiful to watch but missing half the story. Early cinema audiences accepted ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access