book

Video Generation with AI

Name: Video Generation with AI
Author: Joseph Enochs
ISBN: 9798341653344

by Joseph Enochs

July 2026

Intermediate to advanced

276 pages

7h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

1. Foundations of Video Generation
Hands-On: Your First AI VideoEnvironment SetupLoading the ModelGenerate Your First VideoSave and Display Your AchievementIntroduction to Generative Video ModelsWhat Makes Video Unique in AIKey Use Cases for Generative VideoThe Architectures Behind the ModelsGANs, Diffusion Models, and Variational AutoencodersModern Generative Techniques for VideoOpen Models and Baseline ToolsEvaluation and Early ExperimentsGetting Started with AI Video GenerationExercise 1: Implementing Basic Evaluation FunctionsExercise 2: How Inference Steps Impact QualityExercise 3: How Prompt Styles Influence OutputLearning from Your ExperimentsTrade-off Between Inference Steps and QualitySubjectivity of Evaluation MetricsRecommended Settings by Use CaseEvaluation GuideHands-On Exercise: Create and Evaluate Your Own VideoSummary
2. Understanding and Preparing Video Data for Model Training
Hands-On: Working with Video DatasetsSetting Up the WorkspaceLoading the DatasetCommon Challenges in Handling Video DataWorking with Video DatasetsTypes of Video DataRaw Video DataHands-On: Analyzing Raw Video PropertiesCompressed Video DataHands-On: Comparing Raw and Compressed VideoAnnotated Video DataHands-On: Simulating Video StreamingChoosing Data Based on Use CaseSourcing and Collecting Video DataAutomated Filtering PipelinesTools for Video HandlingHands-On: Screening Quality with OpenCVHands-On: Normalizing and Exporting with FFmpegHands-On: Detecting Scenes with PySceneDetectVideo Processing Pipelines for AISummary
3. Implementing Video Generation Pipelines
Building the Foundation: Data Pipeline and EncodingLatent Space Compression with VAEUnderstanding 3D Convolutions for Video VAEVideo Tokenization: From Pixels to PatchesPatch Embedding and Positional EncodingAttention Mechanisms: Understanding Relationships in Space and TimeWhat Is Attention?Spatial Attention: Coherence Within FramesTemporal Attention: Continuity Across FramesThe Alternating Block ArchitectureThe Diffusion Process: From Noise to VideoForward and Reverse ProcessesNoise Prediction and SchedulingAdaptive Conditioning with AdaLNConditioning and Control: From Text to VideoText Encoding with T5Cross-Attention IntegrationClassifier-Free GuidanceLatte Model Component IntegrationTimestep Embedding: Encoding the Diffusion ScheduleText Encoder: From Prompts to EmbeddingsPutting It All Together: The Generation PipelineSummary
4. Training Video Generation Models
Configuration-driven TrainingYAML Configuration StructureLoading and Managing ConfigurationsDataset Preparation and LoadingVideo Dataset ArchitectureMultiresolution Training StrategyTemporal Sampling and Frame SelectionTraining Infrastructure and OptimizationAnatomy of a Training LoopUnderstanding Each Training StepMemory-Efficient Training StrategiesLearning Rate Scheduling and WarmupGradient Management and StabilityLoss Functions and MetricsLoss Design: From Simple to ComplexScaling and Distributed TrainingData Parallelism (Latte Approach)Advanced ParallelismPractical ConsiderationsSummary
5. Fine-Tuning for Specific Video Tasks
Fine-Tuning MethodsLoRA: Low-Rank Adaptation for Video ModelsHow LoRA Works in CogVideoSupervised Fine-Tuning (SFT) with DeepSpeedChoosing Between LoRA and SFTDecision Guide: When to Use Each MethodCommon Fine-Tuning IssuesIssue 1: RuntimeError—Tensor Size MismatchIssue 2: CUDA Out of Memory (OOM)Issue 3: Training Not ConvergingIssue 4: Mixed Precision Training FailsIssue 5: LoRA Weights Not LoadingDomain Adaptation and Transfer LearningFew-Shot and Zero-Shot Video GenerationMeta-Learning for Video GenerationCompositional Video GenerationPrompt Engineering and In-Context LearningEvaluation and Benchmarking Fine-Tuned ModelsTask-Specific Evaluation MetricsAblation Studies and AnalysisTransfer Learning EffectivenessPractical Execution: Running Your First Fine-TuningSetting Up Your EnvironmentPreparing Your DatasetLoRA Fine-Tuning: Quick StartSFT with DeepSpeed: Maximum QualityMulti-GPU Training: Scaling UpInference with Fine-Tuned ModelsProduction ExamplesSummary
6. Vision-Language Models for Video Understanding
From Generation to UnderstandingThe Two Sides of Video AIThe Paradigm ShiftIntroducing Qwen3-VLVLM ArchitectureThe Three-Component ModelVision Encoder: Seeing the VideoThe Projector: Bridging Two WorldsThe LLM: Reasoning Over Multimodal TokensToken Sequence ConstructionPutting It Together: A VideoQA Class3D Positional Encoding for VideoRotary Position Embedding (RoPE)Extending to 3D: Temporal, Height, WidthQwen3-VL’s Interleaved-MRoPE ImplementationTimestamps Versus Frame IndicesExtending VideoQA with Temporal GroundingVideo Understanding PipelineThe Complete Processing PipelineVideo Understanding Task TypesInference Code ExampleHandling Long VideosAdapting to Your Domain: Dataset PreparationTraining with LoRASummary
7. Audiovisual Synchronization with Cross-Modal Fusion
The Missing Modality: AudioWhy Post Hoc Audio FailsAudio Representation: From Waveforms to Mel-SpectrogramsIntroducing LTX-2Unified DiT ArchitectureThe LTXModel DesignThe Modality DataclassBidirectional Cross-Modal AttentionAdaLN: Adaptive Layer NormalizationAudio Pipeline Deep DiveMel-Spectrogram RepresentationAudio VAE EncoderPer-Channel Statistics NormalizationAudio VAE DecoderHiFi-GAN Style VocoderAudioPatchifierTwo-Stage Inference PipelinePipeline Initialization: ModelLedgerStage 1: CFG-Guided GenerationSpatial UpsamplingStage 2: Distilled RefinementFinal DecodingHands-On ImplementationEnvironment SetupHardware RequirementsRequired Model DownloadsText-to-Video GenerationCommand-Line InterfaceImage-to-Video GenerationSpeech SynthesisFine-Tuning with LoRAUsing Trained LoRAs for InferenceArchitectural Evolution: From Research to ProductionSummary
8. From Models to Systems
The Gap Between a Model and a ProductWhat “Production AI System” Means in 2026Three Reference ArchitecturesLTX Desktop: Eight PatternsPattern 1: The Two-Process ArchitecturePattern 2: Backend as Source of TruthPattern 3: The Typed IPC BridgePattern 4: Localhost as a Security BoundaryPattern 5: Model Lifecycle: Download, License, StoragePattern 6: State as Discriminated UnionsPattern 7: Service Protocols and Fakes (No Mocks)Pattern 8: Lock-Aware HandlersThe Landscape: Three Architectures, One ProblemComfyUI: Community Plug-in MeshInvokeAI: Service-Oriented MonolithCapstone: Building a Custom Extension Across Three PlatformsThe Feature: Add LTX-2 Audio-to-Video as a New CapabilityImplementation in LTX DesktopImplementation in ComfyUIImplementation in InvokeAIWhat the Three Implementations RevealSummary and Conclusion: The Video AI Landscape

Content preview from Video Generation with AI

Chapter 6. Vision-Language Models for Video Understanding

Learning objective: In this chapter, you will learn the architecture and implementation of vision-language models (VLMs) for video understanding, including the three-component design (vision encoder, projector, language model), 3D positional encoding for spatiotemporal reasoning, and building practical video question-answering systems with fine-tuning capabilities.

Building on Chapter 5’s fine-tuning techniques, we now tackle a complementary challenge: given a video, how can a model understand and reason about its contents? VLMs provide the answer, transforming video understanding into a language modeling problem. As discussed in Chapter 2, fully semantic understanding is an additional benefit to leveraging VLMs.

From Generation to Understanding

If video generation is like painting from a description, video understanding is like art criticism that analyzes what’s already there. The previous chapters taught you to create video from ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

AI Agents and Applications, Video Edition

Publisher Resources

ISBN: 9798341653337Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Video Generation with AI

by Joseph Enochs

Chapter 6. Vision-Language Models for Video Understanding

From Generation to Understanding

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.