book

Build a Text-to-Image Generator (from Scratch)

Name: Build a Text-to-Image Generator (from Scratch)
Author: MARK LIU
ISBN: 9781633435421

by MARK LIU

December 2025

Beginner to intermediate

360 pages

10h 48m

English

Manning Publications

Read now

Unlock full access

Build a Text-to-Image Generator (from Scratch) With transformers and diffusions
copyright
contents
dedication
preface
acknowledgments
about this book
about the author
about the cover illustration
Part 1 Understanding attention and transformers

1 A tale of two models: Transformers and diffusions
1.1 What is a text-to-image generation model?1.1.1 Unimodal vs. multimodal models1.1.2 Practical use cases of text-to-image models1.2 Transformer-based text-to-image generation1.2.1 Converting an image into a sequence of integers and then back1.2.2 Training and using a transformer-based text-to-image model1.3 Text-to-image generation with diffusion models1.3.1 Forward and reverse diffusions1.3.2 Latent diffusion models and Stable Diffusion1.4 How to build text-to-image models from scratch1.5 Challenges for text-to-image generation models1.5.1 Are generative AI models stealing from artists?1.5.2 The geometric inconsistency problem1.6 Social, environmental, and ethical concerns
2 Build a transformer
2.1 An overview of attention and transformers2.1.1 How the attention mechanism works2.1.2 How to create a transformer2.2 Word embedding and positional encoding2.2.1 Word tokenization with the Spacy library2.2.2 A sequence padding function2.2.3 Input embedding from word embedding and positional encoding2.3 Creating an encoder–decoder transformer2.3.1 Coding the attention mechanism2.3.2 Defining the Transformer() class2.3.3 Creating a language translator2.4 Training and using the German-to-English translator2.4.1 Training the encoder–decoder transformer2.4.2 Translating German to English with the trained model
3 Classify images with a vision transformer
3.1 The blueprint to train a ViT3.1.1 Converting images to sequences3.1.2 Training a ViT for classification3.2 The CIFAR-10 dataset3.2.1 Downloading and visualizing CIFAR-10 images3.2.2 Preparing datasets for training and testing3.3 Building a ViT from scratch3.3.1 Dividing images into patches3.3.2 Modeling the positions of different patches in an image3.3.3 Using the multi-head self-attention mechanism3.3.4 Building an encoder-only transformer3.3.5 Using the ViT to create a classifier3.4 Training and using the ViT to classify images3.4.1 Choosing the optimizer and the loss function3.4.2 Training the ViT for image classification3.4.3 Classifying images using the trained ViT
4 Add captions to images
4.1 Training and using a transformer to add captions4.1.1 Preparing data and the causal attention mask4.1.2 Creating and training a transformer4.2 Preparing the training dataset4.2.1 Downloading and visualizing Flickr 8k images4.2.2 Building a vocabulary of tokens4.2.3 Preparing the training dataset4.3 Creating a multimodal transformer to add captions4.3.1 Defining a ViT as the image encoder4.3.2 Creating the decoder to generate text4.4 Training and using the image-to-text transformer4.4.1 Training the encoder–decoder transformer4.4.2 Adding captions to images with the trained model
Part 2 Introduction to diffusion models
5 Generate images with diffusion models
5.1 The forward diffusion process5.1.1 How diffusion models work5.1.2 Visualizing the forward diffusion process5.1.3 Different diffusion schedules5.2 The reverse diffusion process5.3 A blueprint to train the U-Net model5.3.1 Steps in training a denoising U-Net model5.3.2 Preprocessing the training data5.4 Training and using the diffusion model5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler5.4.2 Inference using the U-Net denoising model5.4.3 Training and using the denoising U-Net model
6 Control what images to generate in diffusion models
6.1 Classifier-free guidance in diffusion models6.1.1 An overview of classifier-free guidance6.1.2 A blueprint to implement CFG6.2 Different components of a denoising U-Net model6.2.1 Time step embedding and label embedding6.2.2 The U-Net denoising model architecture6.2.3 Down blocks and up blocks in the U-Net6.3 Building and training the denoising U-Net model6.3.1 Building the denoising U-Net6.3.2 The Denoising Diffusion Probabilistic Model6.3.3 Training the diffusion model6.4 Generating images with the trained diffusion model6.4.1 Visualizing generated images6.4.2 How the guidance parameter affects generated images
7 Generate high-resolution images with diffusion models
7.1 Attention in U-Net, DDIM, and image interpolation7.1.1 Incorporating the attention mechanism in the U-Net model7.1.2 Denoising Diffusion Implicit Models7.1.3 Image interpolation in diffusion models7.2 High-resolution flower images as training data7.2.1 Visualizing images in the training dataset7.2.2 Applying forward diffusion on flower images7.3 Building and training a U-Net for high-resolution images7.3.1 Building the denoising U-Net model7.3.2 Training the denoising U-Net model7.4 Image generation and interpolation7.4.1 Using the trained denoising U-Net to generate images7.4.2 Transition from one image to another
Part 3 Text-to-image generation with diffusion models
8 CLIP: A model to measure the similarity between image and text
8.1 The CLIP model8.1.1 How the CLIP model works8.1.2 Selecting an image from Flickr 8k based on a text description8.2 Preparing the training dataset8.2.1 Image-caption pairs in Flickr 8k8.2.2 The DistilBERT tokenizer8.2.3 Preprocess captions and images for training8.3 Creating a CLIP model8.3.1 Creating a text encoder8.3.2 Creating an image encoder8.3.3 Building a CLIP model8.4 Training and using the CLIP model8.4.1 Training the CLIP model8.4.2 Using the trained CLIP model to select images8.4.3 Using the OpenAI pretrained CLIP model to select images
9 Text-to-image generation with latent diffusion
9.1 What is a latent diffusion model?9.1.1 How variational autoencoders work9.1.2 Combining a latent diffusion model with a variational autoencoder9.2 Compressing and reconstructing images with VAEs9.2.1 Downloading the pretrained VAE9.2.2 Encoding and decoding images with the pretrained VAE9.3 Text-to-image generation with latent diffusion9.3.1 Guidance by the CLIP model9.3.2 Diffusion in the latent space9.3.3 Converting latent images to high-resolution ones9.4 Modifying existing images with text prompts
10 A deep dive into Stable Diffusion
10.1 Generating images with Stable Diffusion10.2 The Stable Diffusion architecture10.2.1 Generating images from text with Stable Diffusion10.2.2 Text embedding interpolation10.3 Creating text embeddings10.4 Image generation in the latent space10.5 Converting latent images to high-resolution ones
Part 4 Text-to-image generation with transformers
11 VQGAN: Convert images into sequences of integers
11.1 Converting images into sequences of integers and back11.2 Variational autoencoders11.2.1 What is an autoencoder?11.2.2 The need for VAEs and their training methodology11.3 Vector quantized variational autoencoders11.3.1 The need for VQ-VAEs11.3.2 The VQ-VAE model architecture and training process11.4 Vector quantized generative adversarial networks11.4.1 Generative adversarial networks11.4.2 VQGAN: A GAN with a VQ-VAE generator11.5 A pretrained VQGAN model11.5.1 Reconstructing images with the pretrained VQGAN11.5.2 Converting images into sequences of integers
12 A minimal implementation of DALL-E
12.1 How min-DALL-E works12.1.1 Training min-DALL-E12.1.2 From prompt to pixels: Image generation at inference time12.2 Tokenizing and encoding the text prompt12.2.1 Tokenizing the text prompt12.2.2 Encoding the text prompt12.3 Iterative prediction of image tokens12.3.1 Loading the pretrained BART decoder12.3.2 Predicting image tokens using the BART decoder12.4 Converting image tokens to high-resolution images12.4.1 Loading the pretrained VQGAN detokenizer12.4.2 Visualizing the intermediate and final high-resolution outputs
Part 5 New developments and challenges
13 New developments and challenges in text-to-image generation
13.1 State-of-the-art text-to-image generators13.1.1 DALL-E series13.1.2 Google’s Imagen13.1.3 Latent diffusion models: Stable Diffusion and Midjourney13.2 Challenges and concerns13.3 A blueprint to fine-tune ResNet5013.3.1 The history and architecture of ResNet5013.3.2 A plan to fine-tune ResNet50 for classification13.3.3 Using ResNet50 to classify images13.4 Fine-tuning ResNet50 to detect fake images13.4.1 Downloading and preprocessing real and fake face images13.4.2 Fine-tuning ResNet5013.4.3 Detecting deepfakes using the fine-tuned ResNet50
appendix Installing PyTorch and enabling GPU training locally and in Colab
A.1 Installing Python and setting up a virtual environmentA.1.1 Installing AnacondaA.1.2 Setting up a Python virtual environmentA.1.3 Installing Jupyter NotebookA.2 Installing PyTorchA.3 Using Google Colab for GPU training and inference
references

Content preview from Build a Text-to-Image Generator (from Scratch)

8 CLIP: A model to measure the similarity between image and text

This chapter covers

Compressing a text description and an image into the same latent space
Building and training a CLIP model to match text–image pairs
Measuring text–image similarity
Using the trained CLIP model to select an image based on a text prompt

State-of-the-art text-to-image models such as DALL-E 2, Google’s Imagen, and Stable Diffusion are built on three foundational components: (1) a text encoder to convert language into a latent representation, (2) a mechanism for injecting text information into the image-generation process, and (3) a diffusion model to generate realistic images from noise.

In previous chapters, we explored how diffusion models generate images ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Build Financial Software with Generative AI (From Scratch)

Publisher Resources

ISBN: 9781633435421Publisher Support Other Publisher Website Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Build a Text-to-Image Generator (from Scratch)

by MARK LIU