book

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

June 2026

Intermediate to advanced

408 pages

10h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Is This Book For?What You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Vision and Language
Brief Introduction to Computer VisionSignal Decomposition TechniquesFilters and Feature Extraction KernelsFrom Filters to Convolutional Neural NetworksOther Basic Convolutional Neural Network–Based PipelinesTransfer LearningComputer Vision BackbonesTransformers and Their Origins in LanguageVision TransformersModern Vision Language ModelsBrief Introduction to Hugging Face Open Source EcosystemHugging Face HubLibrariesCoding Example: Searching for Images from TextSummary
2. Vision Language Model Applications
Image CaptioningVisual Question AnsweringVisual ReasoningVisual Language RetrievalDocument UnderstandingDocument Visual Question AnsweringVideo UnderstandingInstance Localization with Vision Language ModelsZero-Shot Object DetectionObject CountingImage SegmentationSummary
3. Vision Language Model Training
A Bird’s-Eye View of Training Vision Language ModelsThe How: What Different Training Paradigms DoThe When: A Model’s Training StagesTraining Vision Language ModelsFirst Things First: Training DataLet’s Architect Our Vision Language ModelLoading More Than One Sample at a TimeTraining with a Real Batch SizeInferring with Our Trained ModelMaking Inference Faster with Key-Value CacheDealing with High-Resolution ImagesSummary
4. Training Data and Preprocessing for VLMs
Looking at the DataImage-Text DatasetsVideo-Text DatasetsVision-Language-Action DatasetsBuilding a DatasetData Sourcing at ScaleData Filtering at ScaleSample Diversity at ScaleData Annotation and Quality Validation at ScalePreparing the Dataset for ConsumptionDataset Mixtures: The Hidden HyperparameterMixture Ingredients and ProportionsTask-Driven Mixture DesignAblations and EvaluationSummary
5. Post-Training Vision Language Models
Supervised Fine-TuningParameter Efficient Fine-TuningTraining with QuantizationIntroduction to Transformers Reinforcement LearningReinforcement Learning from Human FeedbackDirect Preference Optimization and Mixed Preference OptimizationGroup Relative Policy OptimizationSummary
6. Core Architectures of Vision Language Models
The Key to Combining Information: Multimodal AttentionSelf-Attention: Finding Relationships Within a SequenceCross-Attention: Bridging Two Different StreamsModern VLM Blueprints: Connecting Pretrained ModelsThe Adapter Approach: Cross-AttentionThe Unified Sequence Approach: Self-AttentionComparing Architectures: Which Way to Go?Foundational Concepts in VLM DesignThe Fusion Framework: Early or Late?The Encoder-Decoder Pattern: Where It All StartedSummary
7. Deploying Models for Inference at Scale
Inference Optimization for VLMsUnderstanding the KV CacheAttention Optimizations: FlashAttention and BeyondUnderstanding GPU Memory: The Foundation of All Inference OptimizationThe Attention Bottleneck and FlashAttention to the RescueUsing Optimized AttentionQuantization for VLMsWhy Quantization Helps: Bandwidth, Not ComputeWeight-Only QuantizationThe Outlier ProblemQuantization MethodsThe VLM Quantization AsymmetryPractical NotestorchaoExporting Models to Different RuntimesONNXTensorRT: Maximum GPU PerformanceBrowser Deployment with transformers.jsPackaging and Deploying in Real EnvironmentsIt RunsEfficient Deployment with vLLMProduction OptimizationsOn-Device/Edge DeploymentThe Edge LandscapeMLX on Apple SiliconLlama.cppMobile DeploymentPEFT Adapters for Edge CustomizationHybrid PatternsSummary
8. Document AI
Introduction to Document AIInformation ExtractionDocument ParsingPicking the “Right” ModelMultimodal Document RetrievalApproaches in Solving Document AI ProblemsEarly Document AI Models for Information Extraction and Document ClassificationCode Examples: Document AI with Modern Vision Language ModelsDocument RAGSummary

9. Video-Language Models
FoundationsVideo-Language TasksFrom Images to Video: Core ConceptsThe Evolution to Video-Language ModelsTemporal ModelingCore Challenges in Temporal ModelingAttention Mechanisms for VideoVideo-Language Models in PracticePicking the Right Output ModeRetrieval Pipelines That ScaleVideo-RAG: Retrieval-Augmented Video QAFine-Tuning a Video-Language Model for Your DomainEfficiency in Video-Language ModelsToken EfficiencyTraining EfficiencySummary
10. Any-to-Any Models
Introducing the Three ApproachesUnified Vocabulary ModelsHybrid ModelsModular ModelsUnified Vocabulary ModelsMonolithic ArchitectureFactorized HeadsHybrid Multiobjective ModelsVariational AutoencodersConnecting Continuous Latents to Language Models Through DiffusionPutting It All Together: From Prompt to Generated OutputLate-Conditioning ModelsConditioning Interfaces: How to Represent IntentConnectors: Bridging the GapHow Generators Use ConditioningHands On: Qwen-Image-Edit, a Modular Diffusion EditorTrainingTask Balance and Data MixingStaged Training: Divide and ConquerArchitecture Specific Loss FunctionsGetting Your Hands DirtySummary
11. Advanced Topics and Cutting-Edge Research
Agentic Vision Language ModelsIntroduction to AgentsIntroduction to SmolagentsComputer Use AgentsVision-Language-Action ModelsClosing the Perception-Action LoopFrom VLM to VLAModel Landscape OverviewSummary
Index
About the Authors

Overview

Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farré, Andrés Marafioti, and Orr Zohar. From image captioning and document understanding to advanced zero-shot inference and retrieval-augmented generation (RAG), this book covers the full VLM application and development lifecycle.

Designed for ML engineers, data scientists, and developers, this guide distills cutting-edge VLM research into practical techniques. Readers will learn how to prepare datasets, select the right architectures, fine-tune and deploy models, and apply them to real-world tasks across a range of industries.

Explore core model architectures and alignment techniques
Train and fine-tune VLMs with Hugging Face, PyTorch, and others
Deploy models for applications like image search and captioning
Implement advanced inference strategies, from zero-shot to agentic systems
Build scalable VLM systems ready for production use

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341624030Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills