book

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

June 2026

Intermediate to advanced

408 pages

10h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Is This Book For?What You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Vision and Language
Brief Introduction to Computer VisionSignal Decomposition TechniquesFilters and Feature Extraction KernelsFrom Filters to Convolutional Neural NetworksOther Basic Convolutional Neural Network–Based PipelinesTransfer LearningComputer Vision BackbonesTransformers and Their Origins in LanguageVision TransformersModern Vision Language ModelsBrief Introduction to Hugging Face Open Source EcosystemHugging Face HubLibrariesCoding Example: Searching for Images from TextSummary
2. Vision Language Model Applications
Image CaptioningVisual Question AnsweringVisual ReasoningVisual Language RetrievalDocument UnderstandingDocument Visual Question AnsweringVideo UnderstandingInstance Localization with Vision Language ModelsZero-Shot Object DetectionObject CountingImage SegmentationSummary
3. Vision Language Model Training
A Bird’s-Eye View of Training Vision Language ModelsThe How: What Different Training Paradigms DoThe When: A Model’s Training StagesTraining Vision Language ModelsFirst Things First: Training DataLet’s Architect Our Vision Language ModelLoading More Than One Sample at a TimeTraining with a Real Batch SizeInferring with Our Trained ModelMaking Inference Faster with Key-Value CacheDealing with High-Resolution ImagesSummary
4. Training Data and Preprocessing for VLMs
Looking at the DataImage-Text DatasetsVideo-Text DatasetsVision-Language-Action DatasetsBuilding a DatasetData Sourcing at ScaleData Filtering at ScaleSample Diversity at ScaleData Annotation and Quality Validation at ScalePreparing the Dataset for ConsumptionDataset Mixtures: The Hidden HyperparameterMixture Ingredients and ProportionsTask-Driven Mixture DesignAblations and EvaluationSummary
5. Post-Training Vision Language Models
Supervised Fine-TuningParameter Efficient Fine-TuningTraining with QuantizationIntroduction to Transformers Reinforcement LearningReinforcement Learning from Human FeedbackDirect Preference Optimization and Mixed Preference OptimizationGroup Relative Policy OptimizationSummary
6. Core Architectures of Vision Language Models
The Key to Combining Information: Multimodal AttentionSelf-Attention: Finding Relationships Within a SequenceCross-Attention: Bridging Two Different StreamsModern VLM Blueprints: Connecting Pretrained ModelsThe Adapter Approach: Cross-AttentionThe Unified Sequence Approach: Self-AttentionComparing Architectures: Which Way to Go?Foundational Concepts in VLM DesignThe Fusion Framework: Early or Late?The Encoder-Decoder Pattern: Where It All StartedSummary
7. Deploying Models for Inference at Scale
Inference Optimization for VLMsUnderstanding the KV CacheAttention Optimizations: FlashAttention and BeyondUnderstanding GPU Memory: The Foundation of All Inference OptimizationThe Attention Bottleneck and FlashAttention to the RescueUsing Optimized AttentionQuantization for VLMsWhy Quantization Helps: Bandwidth, Not ComputeWeight-Only QuantizationThe Outlier ProblemQuantization MethodsThe VLM Quantization AsymmetryPractical NotestorchaoExporting Models to Different RuntimesONNXTensorRT: Maximum GPU PerformanceBrowser Deployment with transformers.jsPackaging and Deploying in Real EnvironmentsIt RunsEfficient Deployment with vLLMProduction OptimizationsOn-Device/Edge DeploymentThe Edge LandscapeMLX on Apple SiliconLlama.cppMobile DeploymentPEFT Adapters for Edge CustomizationHybrid PatternsSummary
8. Document AI
Introduction to Document AIInformation ExtractionDocument ParsingPicking the “Right” ModelMultimodal Document RetrievalApproaches in Solving Document AI ProblemsEarly Document AI Models for Information Extraction and Document ClassificationCode Examples: Document AI with Modern Vision Language ModelsDocument RAGSummary

9. Video-Language Models
FoundationsVideo-Language TasksFrom Images to Video: Core ConceptsThe Evolution to Video-Language ModelsTemporal ModelingCore Challenges in Temporal ModelingAttention Mechanisms for VideoVideo-Language Models in PracticePicking the Right Output ModeRetrieval Pipelines That ScaleVideo-RAG: Retrieval-Augmented Video QAFine-Tuning a Video-Language Model for Your DomainEfficiency in Video-Language ModelsToken EfficiencyTraining EfficiencySummary
10. Any-to-Any Models
Introducing the Three ApproachesUnified Vocabulary ModelsHybrid ModelsModular ModelsUnified Vocabulary ModelsMonolithic ArchitectureFactorized HeadsHybrid Multiobjective ModelsVariational AutoencodersConnecting Continuous Latents to Language Models Through DiffusionPutting It All Together: From Prompt to Generated OutputLate-Conditioning ModelsConditioning Interfaces: How to Represent IntentConnectors: Bridging the GapHow Generators Use ConditioningHands On: Qwen-Image-Edit, a Modular Diffusion EditorTrainingTask Balance and Data MixingStaged Training: Divide and ConquerArchitecture Specific Loss FunctionsGetting Your Hands DirtySummary
11. Advanced Topics and Cutting-Edge Research
Agentic Vision Language ModelsIntroduction to AgentsIntroduction to SmolagentsComputer Use AgentsVision-Language-Action ModelsClosing the Perception-Action LoopFrom VLM to VLAModel Landscape OverviewSummary
Index
About the Authors

Content preview from Vision Language Models

Chapter 5. Post-Training Vision Language Models

Until now, we have been living the full “from scratch” fantasy. We took a tiny vision-language model (VLM), wired images into text, fought with padding and packing, watched the loss wobble its way down, and even made it answer a few questions about bears and mountains. That is roughly what pretraining and basic supervised training look like in miniature.

In practice, though, most people do not spin up a VLM from nothing. You usually start from a strong base model that already knows a lot about language and images, and then you nudge it into the behavior you actually want. That second phase is what people call post-training. It has three big ingredients:

Supervised fine-tuning (SFT): This is where you show the model lots of “here is a prompt, here is a good answer” pairs so it can follow instructions and handle the tasks you care about.
Alignment: This is where you teach the model what humans actually prefer, using techniques like reinforcement learning from human feedback, or RLHF (direct preference optimization, mixed preference optimization, and group relative policy optimization) so that the answers are not just correct but also helpful and safe.
Reinforcement learning with verifiable rewards (RLVR): This replaces human preferences with verifiable reward functions (like math or code results) for increased performance on tasks that can be verified.

In this chapter, we will look at both parts. We will start with parameter efficient ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341624030Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

Chapter 5. Post-Training Vision Language Models

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.