book

Transformers: The Definitive Guide

Name: Transformers: The Definitive Guide
Author: Nicole Koenigstein
ISBN: 9781098167011

by Nicole Koenigstein

April 2026

Intermediate to advanced

372 pages

9h 35m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What This Book Is AboutWhat This Book Is NotWho This Book Is ForNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. From First Principles to State-of-the-Art Transformers
Transformer BasicsTokenizer: Text Representation in the TransformerToken and Positional EmbeddingsAttention MechanismEncoder and Decoder PartsEnhancements in Transformer Design: Longer Context and Attention VariationsLonger Context Windows with Better PerformanceAttention Mechanism VariationsConclusion
2. Transformers for Time Series
Understanding the Intricacies of Time Series DataAutocorrelation and Partial AutocorrelationCointegrationCross-CorrelationStationarityTrend and SeasonalityPreparing a DatasetTime Series Modeling in Various Application DomainsTokenizing Time Series DataChronos: Learning the Language of Time SeriesFine-Tuning ChronosPatchTST: A Time Series Is Worth 64 WordsFine-Tuning PatchTST on Historical IBM Stock PricesTimesFM: A Decoder-Only Time Series Foundation ModelFine-Tuning TimesFM on Hourly Energy Consumption DataAnomalyBERT for Self-Supervised Anomaly DetectionConclusion
3. Transformers for Vision Tasks
Overview of Different Vision TasksEmbeddings and Tokenization for Vision ModelsKey Strategies for Improving the Robustness and Effectiveness of Vision TasksSwin Transformer V2Image classification with Swin Transformer V2Segment AnythingFine-Tuning SAM on a Custom DatasetSegment Anything in Images and VideosSegment Videos and Images with Concept PromptsConclusion
4. Transformers for Image Generation
Introduction to Generative Image ModelsDiffusion Models: What’s That Noise About?Classifier-Free Guidance in Diffusion ModelsScalable Diffusion Models with TransformersGenerating Images with the DiTPIXART-αPixArt-ΣGenerating Images with PixArt-ΣDiffusion Vision Transformers for Image GenerationInterpretable Features with Diffusion TransformersConclusion
5. Transformers for Video Generation
Hidden Effectiveness of Latent DiffusionLTX-Video: Video in RealtimeLatte: Structured Detail, Poured into Every Video FrameTora: From Trajectory to Storyline, One Frame at a TimeConclusion
6. From Sound to Token and Back: Transformers in the Audio Domain
From Waveforms to Spectrograms: Understanding the Structure of Audio DataAudio as a WaveformSampling Rate and the Nyquist TheoremAmplitude, Bit Depth, and QuantizationThe Frequency Domain and Fourier TransformSpectrograms and the Short-Time Fourier TransformThe Mel Spectrogram and Perceptual ScalingPhase, Reconstruction, and VocodersAudio Modeling in Various Application DomainsTransformer Architectures for Audio: From Perception to Foundational IntelligenceThe Rise of Speech Transformers: The Impact of WhisperAudio Foundation Models: Unifying Understanding, Generation, and ConversationQwen2-AudioTranscribing a Meeting with Kimi-AudioSegment Anything in AudioBeyond Text and Speech: Transformers as Music ComposersConclusion
7. Reinforcement Learning Transformers
Getting Started with Reinforcement LearningFoundational Concepts in Reinforcement LearningOnline and Offline Reinforcement LearningModel-Based and Model-Free ApproachesOn-Policy Versus Off-Policy Reinforcement LearningTemporal Difference LearningWorld Models in Reinforcement LearningTransformers in Reinforcement LearningDecision TransformerGoing Live: Online Decision TransformerA Brave New World: Stochastic Transformer-Based World ModelTWISTER: Transformer-Based World Models with Contrastive Predictive CodingConclusion
8. Embracing the Era of Experience: Transformers for Planning, Reasoning, and Coding
From Human Data to Lived ExperienceLearning to Reason: From Pretraining to Reinforcement LearningDeepSeek-R1: Reinforcing Reasoning CapabilitiesQwen3: Unified Reasoning with Dynamic ControlQwen3-Coder: Agentic Reasoning for Open-Ended CodingKimi K2: Open Agentic Intelligence at ScaleMuon: Scaling Optimization for the Agentic EraInference with Kimi K2Scaling Reasoning at Test-Time: Smarter, Not Just BiggerAdaptive Branching Monte Carlo Tree Search (AB-MCTS)The RethinkMCTS Framework for Code GenerationThe S* Framework for Code GenerationConclusion
9. From Scripts to Thinking: AI Agents for Complex Tasks
Autonomy: What’s Possible at the Moment?Designing Agent WorkflowsMulti-Agent ArchitecturesAgentic Communication: The Right Context Is All You NeedBeyond Context: How to Help Agents RememberAgent Memory TypesGoing Global and LifelongThe Human Factor: Steering Agent ActionsCommon Patterns for Human-in-the-LoopSolving GitHub Issues with Coding AgentsConclusion

10. Smarter, Better, Faster, Stronger: Optimizing LLMs and AI Agents
Training-Time Intelligence: Reinforcement Learning for AgentsBeyond Hand-Crafted Rewards: How RULER WorksTraining in Practice: ART in a Market ScenarioReason Smarter, Not Harder: Adaptive Compute AllocationThe Delta Incentive: Enforcing EfficiencyOpen Innovation: Community-Driven RL FrameworksThe Checkpoint Engine: Systems-Level Optimization for LLM Policy UpdatesConclusion
11. Deploying Transformer Models
Choosing Between Open and Closed SourceUnderstanding the Architecture You’re DeployingDeploying Decoder-Only ModelsRuntime Engineering for Decoder-Only ModelsSecurity Considerations for Decoder-Only DeploymentsBuilding Applications with Coding ModelsEvaluating LLM Deployments in ProductionCost Efficiency and Hardware ComparisonQuantizationTest-Time Low-Rank Adaptation in Vision-Language ModelsConclusion
12. Where to Go Next: From Models to Intelligent Systems
Combining Capabilities: SAM 3 AgentThe Science of Scaling Agentic SystemsConclusion
Index
About the Author

Content preview from Transformers: The Definitive Guide

Chapter 5. Transformers for Video Generation

This chapter will be a fun ride, because a lot of what you’ve learned so far now comes together in SOTA text-to-video (T2V) and image-to-video (I2V) models. You’ve already seen how ViT restructures images as patches (“Embeddings and Tokenization for Vision Models”) and how DiT extends this into generative diffusion (“Scalable Diffusion Models with Transformers”). In fact, many T2V and I2V models build on DiT by adding a temporal dimension, stacking latent patches across time. And remember those rotary positional embeddings from “Longer Context Windows with Better Performance”? You’ll see them here again too.

Since you already understand how T2I works, T2V and I2V become less of a leap and more of a natural generalization. You just move from generating a single frame to generating a coherent sequence. Most T2V and I2V models can also generate static images and even 3D views using the same core backbone. Most of the time, the architecture stays the same. Only the axis grows.

For me, that’s the elegance of transformers. Once you understand how they work, you start to see how everything fits together, across image, video, audio, and more. That’s exactly why I chose to write this book about transformers beyond language: because the deeper insight is not in treating each domain separately, as with other models in deep learning, but in realizing how naturally the architecture extends across different domains.

Hence, in this chapter I’ll take ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098167004Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Transformers: The Definitive Guide

by Nicole Koenigstein

Chapter 5. Transformers for Video Generation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.