book

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

June 2026

Intermediate to advanced

408 pages

10h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Is This Book For?What You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Vision and Language
Brief Introduction to Computer VisionSignal Decomposition TechniquesFilters and Feature Extraction KernelsFrom Filters to Convolutional Neural NetworksOther Basic Convolutional Neural Network–Based PipelinesTransfer LearningComputer Vision BackbonesTransformers and Their Origins in LanguageVision TransformersModern Vision Language ModelsBrief Introduction to Hugging Face Open Source EcosystemHugging Face HubLibrariesCoding Example: Searching for Images from TextSummary
2. Vision Language Model Applications
Image CaptioningVisual Question AnsweringVisual ReasoningVisual Language RetrievalDocument UnderstandingDocument Visual Question AnsweringVideo UnderstandingInstance Localization with Vision Language ModelsZero-Shot Object DetectionObject CountingImage SegmentationSummary
3. Vision Language Model Training
A Bird’s-Eye View of Training Vision Language ModelsThe How: What Different Training Paradigms DoThe When: A Model’s Training StagesTraining Vision Language ModelsFirst Things First: Training DataLet’s Architect Our Vision Language ModelLoading More Than One Sample at a TimeTraining with a Real Batch SizeInferring with Our Trained ModelMaking Inference Faster with Key-Value CacheDealing with High-Resolution ImagesSummary
4. Training Data and Preprocessing for VLMs
Looking at the DataImage-Text DatasetsVideo-Text DatasetsVision-Language-Action DatasetsBuilding a DatasetData Sourcing at ScaleData Filtering at ScaleSample Diversity at ScaleData Annotation and Quality Validation at ScalePreparing the Dataset for ConsumptionDataset Mixtures: The Hidden HyperparameterMixture Ingredients and ProportionsTask-Driven Mixture DesignAblations and EvaluationSummary
5. Post-Training Vision Language Models
Supervised Fine-TuningParameter Efficient Fine-TuningTraining with QuantizationIntroduction to Transformers Reinforcement LearningReinforcement Learning from Human FeedbackDirect Preference Optimization and Mixed Preference OptimizationGroup Relative Policy OptimizationSummary
6. Core Architectures of Vision Language Models
The Key to Combining Information: Multimodal AttentionSelf-Attention: Finding Relationships Within a SequenceCross-Attention: Bridging Two Different StreamsModern VLM Blueprints: Connecting Pretrained ModelsThe Adapter Approach: Cross-AttentionThe Unified Sequence Approach: Self-AttentionComparing Architectures: Which Way to Go?Foundational Concepts in VLM DesignThe Fusion Framework: Early or Late?The Encoder-Decoder Pattern: Where It All StartedSummary
7. Deploying Models for Inference at Scale
Inference Optimization for VLMsUnderstanding the KV CacheAttention Optimizations: FlashAttention and BeyondUnderstanding GPU Memory: The Foundation of All Inference OptimizationThe Attention Bottleneck and FlashAttention to the RescueUsing Optimized AttentionQuantization for VLMsWhy Quantization Helps: Bandwidth, Not ComputeWeight-Only QuantizationThe Outlier ProblemQuantization MethodsThe VLM Quantization AsymmetryPractical NotestorchaoExporting Models to Different RuntimesONNXTensorRT: Maximum GPU PerformanceBrowser Deployment with transformers.jsPackaging and Deploying in Real EnvironmentsIt RunsEfficient Deployment with vLLMProduction OptimizationsOn-Device/Edge DeploymentThe Edge LandscapeMLX on Apple SiliconLlama.cppMobile DeploymentPEFT Adapters for Edge CustomizationHybrid PatternsSummary
8. Document AI
Introduction to Document AIInformation ExtractionDocument ParsingPicking the “Right” ModelMultimodal Document RetrievalApproaches in Solving Document AI ProblemsEarly Document AI Models for Information Extraction and Document ClassificationCode Examples: Document AI with Modern Vision Language ModelsDocument RAGSummary

9. Video-Language Models
FoundationsVideo-Language TasksFrom Images to Video: Core ConceptsThe Evolution to Video-Language ModelsTemporal ModelingCore Challenges in Temporal ModelingAttention Mechanisms for VideoVideo-Language Models in PracticePicking the Right Output ModeRetrieval Pipelines That ScaleVideo-RAG: Retrieval-Augmented Video QAFine-Tuning a Video-Language Model for Your DomainEfficiency in Video-Language ModelsToken EfficiencyTraining EfficiencySummary
10. Any-to-Any Models
Introducing the Three ApproachesUnified Vocabulary ModelsHybrid ModelsModular ModelsUnified Vocabulary ModelsMonolithic ArchitectureFactorized HeadsHybrid Multiobjective ModelsVariational AutoencodersConnecting Continuous Latents to Language Models Through DiffusionPutting It All Together: From Prompt to Generated OutputLate-Conditioning ModelsConditioning Interfaces: How to Represent IntentConnectors: Bridging the GapHow Generators Use ConditioningHands On: Qwen-Image-Edit, a Modular Diffusion EditorTrainingTask Balance and Data MixingStaged Training: Divide and ConquerArchitecture Specific Loss FunctionsGetting Your Hands DirtySummary
11. Advanced Topics and Cutting-Edge Research
Agentic Vision Language ModelsIntroduction to AgentsIntroduction to SmolagentsComputer Use AgentsVision-Language-Action ModelsClosing the Perception-Action LoopFrom VLM to VLAModel Landscape OverviewSummary
Index
About the Authors

Content preview from Vision Language Models

Chapter 9. Video-Language Models

The when and how of events unfolding over time is what separates interpreting a photo from understanding a story: the world doesn’t stand still. What is the deer in Figure 9-1 doing in the street? Looking at the multiple frames, we see it is in a park, and people are not scared but are enjoying observing and interacting with it.

Diagram showing frames from a video analyzed by a video language model, illustrating a deer walking in a park and interacting with people.

Videos contain everything images do plus time. The temporal dimension is deceptively expensive: 300 frames per 10-second clip, quadratic attention costs that explode with sequence length, and motion patterns that require understanding not just what objects are but how they move. A strong image classifier might nail every frame individually but completely miss the action that ties them together.

In this chapter, you will learn how video-language models tackle these challenges. You will see how temporal modeling mechanisms capture motion and event sequences, how attention strategies scale to handle hundreds of frames while fitting in a GPU, and how to combine embedding and generative models into practical pipelines that work at scale.

By the end of this chapter, you will:

Understand how video models evolved from 3D convolutional neural networks (CNNs) to modern transformer architectures and why factorization ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341624030Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

Chapter 9. Video-Language Models

Figure 9-1. We provide frames from a video and their timestamps to a video-language model, and we get answers from it.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.