book

Build a Text-to-Image Generator (from Scratch)

Name: Build a Text-to-Image Generator (from Scratch)
Author: MARK LIU
ISBN: 9781633435421

by MARK LIU

December 2025

Beginner to intermediate

360 pages

10h 48m

English

Manning Publications

Read now

Unlock full access

Build a Text-to-Image Generator (from Scratch) With transformers and diffusions
copyright
contents
dedication
preface
acknowledgments
about this book
about the author
about the cover illustration
Part 1 Understanding attention and transformers

1 A tale of two models: Transformers and diffusions
1.1 What is a text-to-image generation model?1.1.1 Unimodal vs. multimodal models1.1.2 Practical use cases of text-to-image models1.2 Transformer-based text-to-image generation1.2.1 Converting an image into a sequence of integers and then back1.2.2 Training and using a transformer-based text-to-image model1.3 Text-to-image generation with diffusion models1.3.1 Forward and reverse diffusions1.3.2 Latent diffusion models and Stable Diffusion1.4 How to build text-to-image models from scratch1.5 Challenges for text-to-image generation models1.5.1 Are generative AI models stealing from artists?1.5.2 The geometric inconsistency problem1.6 Social, environmental, and ethical concerns
2 Build a transformer
2.1 An overview of attention and transformers2.1.1 How the attention mechanism works2.1.2 How to create a transformer2.2 Word embedding and positional encoding2.2.1 Word tokenization with the Spacy library2.2.2 A sequence padding function2.2.3 Input embedding from word embedding and positional encoding2.3 Creating an encoder–decoder transformer2.3.1 Coding the attention mechanism2.3.2 Defining the Transformer() class2.3.3 Creating a language translator2.4 Training and using the German-to-English translator2.4.1 Training the encoder–decoder transformer2.4.2 Translating German to English with the trained model
3 Classify images with a vision transformer
3.1 The blueprint to train a ViT3.1.1 Converting images to sequences3.1.2 Training a ViT for classification3.2 The CIFAR-10 dataset3.2.1 Downloading and visualizing CIFAR-10 images3.2.2 Preparing datasets for training and testing3.3 Building a ViT from scratch3.3.1 Dividing images into patches3.3.2 Modeling the positions of different patches in an image3.3.3 Using the multi-head self-attention mechanism3.3.4 Building an encoder-only transformer3.3.5 Using the ViT to create a classifier3.4 Training and using the ViT to classify images3.4.1 Choosing the optimizer and the loss function3.4.2 Training the ViT for image classification3.4.3 Classifying images using the trained ViT
4 Add captions to images
4.1 Training and using a transformer to add captions4.1.1 Preparing data and the causal attention mask4.1.2 Creating and training a transformer4.2 Preparing the training dataset4.2.1 Downloading and visualizing Flickr 8k images4.2.2 Building a vocabulary of tokens4.2.3 Preparing the training dataset4.3 Creating a multimodal transformer to add captions4.3.1 Defining a ViT as the image encoder4.3.2 Creating the decoder to generate text4.4 Training and using the image-to-text transformer4.4.1 Training the encoder–decoder transformer4.4.2 Adding captions to images with the trained model
Part 2 Introduction to diffusion models
5 Generate images with diffusion models
5.1 The forward diffusion process5.1.1 How diffusion models work5.1.2 Visualizing the forward diffusion process5.1.3 Different diffusion schedules5.2 The reverse diffusion process5.3 A blueprint to train the U-Net model5.3.1 Steps in training a denoising U-Net model5.3.2 Preprocessing the training data5.4 Training and using the diffusion model5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler5.4.2 Inference using the U-Net denoising model5.4.3 Training and using the denoising U-Net model
6 Control what images to generate in diffusion models
6.1 Classifier-free guidance in diffusion models6.1.1 An overview of classifier-free guidance6.1.2 A blueprint to implement CFG6.2 Different components of a denoising U-Net model6.2.1 Time step embedding and label embedding6.2.2 The U-Net denoising model architecture6.2.3 Down blocks and up blocks in the U-Net6.3 Building and training the denoising U-Net model6.3.1 Building the denoising U-Net6.3.2 The Denoising Diffusion Probabilistic Model6.3.3 Training the diffusion model6.4 Generating images with the trained diffusion model6.4.1 Visualizing generated images6.4.2 How the guidance parameter affects generated images
7 Generate high-resolution images with diffusion models
7.1 Attention in U-Net, DDIM, and image interpolation7.1.1 Incorporating the attention mechanism in the U-Net model7.1.2 Denoising Diffusion Implicit Models7.1.3 Image interpolation in diffusion models7.2 High-resolution flower images as training data7.2.1 Visualizing images in the training dataset7.2.2 Applying forward diffusion on flower images7.3 Building and training a U-Net for high-resolution images7.3.1 Building the denoising U-Net model7.3.2 Training the denoising U-Net model7.4 Image generation and interpolation7.4.1 Using the trained denoising U-Net to generate images7.4.2 Transition from one image to another
Part 3 Text-to-image generation with diffusion models
8 CLIP: A model to measure the similarity between image and text
8.1 The CLIP model8.1.1 How the CLIP model works8.1.2 Selecting an image from Flickr 8k based on a text description8.2 Preparing the training dataset8.2.1 Image-caption pairs in Flickr 8k8.2.2 The DistilBERT tokenizer8.2.3 Preprocess captions and images for training8.3 Creating a CLIP model8.3.1 Creating a text encoder8.3.2 Creating an image encoder8.3.3 Building a CLIP model8.4 Training and using the CLIP model8.4.1 Training the CLIP model8.4.2 Using the trained CLIP model to select images8.4.3 Using the OpenAI pretrained CLIP model to select images
9 Text-to-image generation with latent diffusion
9.1 What is a latent diffusion model?9.1.1 How variational autoencoders work9.1.2 Combining a latent diffusion model with a variational autoencoder9.2 Compressing and reconstructing images with VAEs9.2.1 Downloading the pretrained VAE9.2.2 Encoding and decoding images with the pretrained VAE9.3 Text-to-image generation with latent diffusion9.3.1 Guidance by the CLIP model9.3.2 Diffusion in the latent space9.3.3 Converting latent images to high-resolution ones9.4 Modifying existing images with text prompts
10 A deep dive into Stable Diffusion
10.1 Generating images with Stable Diffusion10.2 The Stable Diffusion architecture10.2.1 Generating images from text with Stable Diffusion10.2.2 Text embedding interpolation10.3 Creating text embeddings10.4 Image generation in the latent space10.5 Converting latent images to high-resolution ones
Part 4 Text-to-image generation with transformers
11 VQGAN: Convert images into sequences of integers
11.1 Converting images into sequences of integers and back11.2 Variational autoencoders11.2.1 What is an autoencoder?11.2.2 The need for VAEs and their training methodology11.3 Vector quantized variational autoencoders11.3.1 The need for VQ-VAEs11.3.2 The VQ-VAE model architecture and training process11.4 Vector quantized generative adversarial networks11.4.1 Generative adversarial networks11.4.2 VQGAN: A GAN with a VQ-VAE generator11.5 A pretrained VQGAN model11.5.1 Reconstructing images with the pretrained VQGAN11.5.2 Converting images into sequences of integers
12 A minimal implementation of DALL-E
12.1 How min-DALL-E works12.1.1 Training min-DALL-E12.1.2 From prompt to pixels: Image generation at inference time12.2 Tokenizing and encoding the text prompt12.2.1 Tokenizing the text prompt12.2.2 Encoding the text prompt12.3 Iterative prediction of image tokens12.3.1 Loading the pretrained BART decoder12.3.2 Predicting image tokens using the BART decoder12.4 Converting image tokens to high-resolution images12.4.1 Loading the pretrained VQGAN detokenizer12.4.2 Visualizing the intermediate and final high-resolution outputs
Part 5 New developments and challenges
13 New developments and challenges in text-to-image generation
13.1 State-of-the-art text-to-image generators13.1.1 DALL-E series13.1.2 Google’s Imagen13.1.3 Latent diffusion models: Stable Diffusion and Midjourney13.2 Challenges and concerns13.3 A blueprint to fine-tune ResNet5013.3.1 The history and architecture of ResNet5013.3.2 A plan to fine-tune ResNet50 for classification13.3.3 Using ResNet50 to classify images13.4 Fine-tuning ResNet50 to detect fake images13.4.1 Downloading and preprocessing real and fake face images13.4.2 Fine-tuning ResNet5013.4.3 Detecting deepfakes using the fine-tuned ResNet50
appendix Installing PyTorch and enabling GPU training locally and in Colab
A.1 Installing Python and setting up a virtual environmentA.1.1 Installing AnacondaA.1.2 Setting up a Python virtual environmentA.1.3 Installing Jupyter NotebookA.2 Installing PyTorchA.3 Using Google Colab for GPU training and inference
references

Overview

Build your own vision transformer and diffusion models for text-to-image generation–from scratch!

Build a Text-to-Image Generator (from Scratch) takes you step-by-step through creating your own AI models that can generate images from text. You’ll explore two methods of image generation—vision transformers and diffusion models—and learn vital AI development techniques as you go.

Build a Text-to-Image Generator (from Scratch) teaches you how to:

Build and train models to generate high resolution images based on text descriptions
Edit an existing image based on text prompts
Build and train a model to add captions to images
Build and train a vision transformer to classify images
Fine-tune LLMs for downstream tasks such as classification, text or image generation
Better differentiate real images from deepfakes

Build a Text-to-Image Generator (from Scratch) dives into the powerful models behind AI image generators. The best way to learn is to build something from scratch, and in this book you’ll build your very own diffusion model and vision transformer. As you work through each stage of development, you’ll develop an understanding of how these models can be customized, applied, and integrated for impressive multimodal AI.

About the Technology
AI-generated images appear everywhere from high-end advertising to casual social media feeds. Text-to-image tools like Dall-e, Midjourney, and Flux make it easy to create AI art, but how do they work? In this book, you’ll find out by building your own text-to-image generator!

About the Book
Build a Text-to-Image Generator (from Scratch) explores both transformer-based image generation and diffusion models. You’ll work hands-on to build a pair of simple generation models that can classify images, automatically add captions, reconstruct images, and enhance existing graphics. Author Mark Liu guides you every step of the way with clear explanations, informative diagrams, and eye-opening examples you can build on your own laptop.

What's Inside

Build a vision transformer to classify images
Edit images using text prompts
Fine-tune image models

About the Reader
Requires basic knowledge of generative AI models and intermediate Python skills.

About the Author
Mark Liu is the founding director of the Master of Science in Finance program at the University of Kentucky. He is also the author of Learn Generative AI with PyTorch.

Quotes
A practical and readable introduction with working code and clear explanations.
- Andrey Lukyanenko, Meta

Empowers you to unlock creativity at the intersection of text and imagery.
- Bojan Tunguz, Tabul.AI

Amazingly comprehensive, hype-free, hands-on, and code-rich guidebook.
- Kirk Borne, Data Leadership Group

Successfully brings together the theoretical foundations and practical applications, from transformers to diffusion models.
- Raymond Cheung, Parity Technologies

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Build Financial Software with Generative AI (From Scratch)

Publisher Resources

ISBN: 9781633435421Publisher Support Other Publisher Website Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills