book

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

June 2026

Intermediate to advanced

408 pages

10h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Is This Book For?What You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Vision and Language
Brief Introduction to Computer VisionSignal Decomposition TechniquesFilters and Feature Extraction KernelsFrom Filters to Convolutional Neural NetworksOther Basic Convolutional Neural Network–Based PipelinesTransfer LearningComputer Vision BackbonesTransformers and Their Origins in LanguageVision TransformersModern Vision Language ModelsBrief Introduction to Hugging Face Open Source EcosystemHugging Face HubLibrariesCoding Example: Searching for Images from TextSummary
2. Vision Language Model Applications
Image CaptioningVisual Question AnsweringVisual ReasoningVisual Language RetrievalDocument UnderstandingDocument Visual Question AnsweringVideo UnderstandingInstance Localization with Vision Language ModelsZero-Shot Object DetectionObject CountingImage SegmentationSummary
3. Vision Language Model Training
A Bird’s-Eye View of Training Vision Language ModelsThe How: What Different Training Paradigms DoThe When: A Model’s Training StagesTraining Vision Language ModelsFirst Things First: Training DataLet’s Architect Our Vision Language ModelLoading More Than One Sample at a TimeTraining with a Real Batch SizeInferring with Our Trained ModelMaking Inference Faster with Key-Value CacheDealing with High-Resolution ImagesSummary
4. Training Data and Preprocessing for VLMs
Looking at the DataImage-Text DatasetsVideo-Text DatasetsVision-Language-Action DatasetsBuilding a DatasetData Sourcing at ScaleData Filtering at ScaleSample Diversity at ScaleData Annotation and Quality Validation at ScalePreparing the Dataset for ConsumptionDataset Mixtures: The Hidden HyperparameterMixture Ingredients and ProportionsTask-Driven Mixture DesignAblations and EvaluationSummary
5. Post-Training Vision Language Models
Supervised Fine-TuningParameter Efficient Fine-TuningTraining with QuantizationIntroduction to Transformers Reinforcement LearningReinforcement Learning from Human FeedbackDirect Preference Optimization and Mixed Preference OptimizationGroup Relative Policy OptimizationSummary
6. Core Architectures of Vision Language Models
The Key to Combining Information: Multimodal AttentionSelf-Attention: Finding Relationships Within a SequenceCross-Attention: Bridging Two Different StreamsModern VLM Blueprints: Connecting Pretrained ModelsThe Adapter Approach: Cross-AttentionThe Unified Sequence Approach: Self-AttentionComparing Architectures: Which Way to Go?Foundational Concepts in VLM DesignThe Fusion Framework: Early or Late?The Encoder-Decoder Pattern: Where It All StartedSummary
7. Deploying Models for Inference at Scale
Inference Optimization for VLMsUnderstanding the KV CacheAttention Optimizations: FlashAttention and BeyondUnderstanding GPU Memory: The Foundation of All Inference OptimizationThe Attention Bottleneck and FlashAttention to the RescueUsing Optimized AttentionQuantization for VLMsWhy Quantization Helps: Bandwidth, Not ComputeWeight-Only QuantizationThe Outlier ProblemQuantization MethodsThe VLM Quantization AsymmetryPractical NotestorchaoExporting Models to Different RuntimesONNXTensorRT: Maximum GPU PerformanceBrowser Deployment with transformers.jsPackaging and Deploying in Real EnvironmentsIt RunsEfficient Deployment with vLLMProduction OptimizationsOn-Device/Edge DeploymentThe Edge LandscapeMLX on Apple SiliconLlama.cppMobile DeploymentPEFT Adapters for Edge CustomizationHybrid PatternsSummary
8. Document AI
Introduction to Document AIInformation ExtractionDocument ParsingPicking the “Right” ModelMultimodal Document RetrievalApproaches in Solving Document AI ProblemsEarly Document AI Models for Information Extraction and Document ClassificationCode Examples: Document AI with Modern Vision Language ModelsDocument RAGSummary

9. Video-Language Models
FoundationsVideo-Language TasksFrom Images to Video: Core ConceptsThe Evolution to Video-Language ModelsTemporal ModelingCore Challenges in Temporal ModelingAttention Mechanisms for VideoVideo-Language Models in PracticePicking the Right Output ModeRetrieval Pipelines That ScaleVideo-RAG: Retrieval-Augmented Video QAFine-Tuning a Video-Language Model for Your DomainEfficiency in Video-Language ModelsToken EfficiencyTraining EfficiencySummary
10. Any-to-Any Models
Introducing the Three ApproachesUnified Vocabulary ModelsHybrid ModelsModular ModelsUnified Vocabulary ModelsMonolithic ArchitectureFactorized HeadsHybrid Multiobjective ModelsVariational AutoencodersConnecting Continuous Latents to Language Models Through DiffusionPutting It All Together: From Prompt to Generated OutputLate-Conditioning ModelsConditioning Interfaces: How to Represent IntentConnectors: Bridging the GapHow Generators Use ConditioningHands On: Qwen-Image-Edit, a Modular Diffusion EditorTrainingTask Balance and Data MixingStaged Training: Divide and ConquerArchitecture Specific Loss FunctionsGetting Your Hands DirtySummary
11. Advanced Topics and Cutting-Edge Research
Agentic Vision Language ModelsIntroduction to AgentsIntroduction to SmolagentsComputer Use AgentsVision-Language-Action ModelsClosing the Perception-Action LoopFrom VLM to VLAModel Landscape OverviewSummary
Index
About the Authors

Content preview from Vision Language Models

Preface

Today, you can take your phone out in a museum, snap a picture of a painting, and ask a model about the influences the artist drew on and what the piece might be trying to convey. The same model can watch the videos on your phone and give you quick summaries to help you find them later. Vision language models (VLMs) make all of this possible by connecting visual perception and language. They have moved quickly from research prototypes to real products that people use every day.

But building new things with these models is harder than the user experience suggests. The field moves fast, new articles come out daily, and practical guidance is scattered across blog posts, library docs, and informal knowledge passed around at networking events. If you want to use, train, or fine-tune a VLM, it is not obvious how to choose the right architecture, how to curate your datasets, or how to deploy efficiently. You end up piecing the knowledge together yourself.

This book is our attempt to change that. It is the book we wished we had when multimodal work stopped being a research curiosity and became an engineering problem.

We wrote it as a team that has spent years building, documenting, and shipping open source multimodal systems at Hugging Face. Between us we have trained and released VLMs like SmolVLM, integrated dozens of multimodal models into the open source ecosystem, built tooling and demos that make these models accessible to practitioners, and written extensively about the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341624030Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Vision Language Models

by Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.