Chapter 1. An Introduction to Generative Media

Generative models have become widely popular in recent years. If you’re reading this book, you’ve probably interacted with a generative model at some point. Maybe you’ve used ChatGPT to generate text, used style transfer in apps like Instagram, or seen the deepfake videos that have been making headlines. These are all examples of generative models in action!

In this book, we’ll explore the world of generative models, starting with the basics of two families of generative models, transformers and diffusion, and working our way up to more advanced topics. We’ll cover the types of generative models, how they work, and how to use them. In this chapter, we’ll cover some of the history of how we got here and take a look at the capabilities offered by some of the models, which we’ll explore in more depth throughout the book.

So, what exactly is generative modeling? At its core, it’s about teaching a model to generate new data that resembles its training data. For example, if I train a model on a dataset of images of cats, I can then use that model to generate new images of cats that look like they could have come from the original dataset. This is a powerful idea, and it has a wide range of applications, from creating novel images and videos to generating text with a specific style.

Throughout this book, you’ll discover popular tools that make using existing generative models straightforward. The world of machine learning (ML) offers numerous open-access models, trained on large datasets, available for anyone to use. Training these models from scratch can be costly and time-consuming, but open-access models provide a practical and efficient alternative. These pretrained models can generate new data, classify existing data, and be adapted for new applications. One of the most popular places to find open-access models is Hugging Face, a platform with over two million models for many ML tasks, including image generation.

Generating Images

As an example of an open source library, we’ll kick off with diffusers. This popular library provides access to state-of-the-art (SOTA) diffusion models. It’s a powerful, simple toolbox that allows us to quickly load and train diffusion models.

By going to the Hugging Face Hub and filtering for models that generate images based on a text prompt (text-to-image), we can find some of the most popular models, such as Stable Diffusion and SDXL. We’ll use Stable Diffusion 1.5, a diffusion model capable of generating high-quality images. If you browse the model website, you can read the model card, an essential document for discoverability and reproducibility. There, you can read about the model, how it was trained, intended use cases, and more.

Given we have a model (Stable Diffusion) and a tool to use the model (diffusers), we can now generate our first image! When we load models, we’ll need to send them to a specific hardware device, such as CPU (cpu), GPU (cuda or cuda:0), or Mac hardware called Metal (mps). The genaibook library we mentioned in the Preface has a utility function to select an appropriate device depending on where you run the example code. For example, the following code will assign cuda to the device variable if you have a GPU:

from genaibook.core import get_device

device = get_device()
print(f"Using device: {device}")

Using device: cuda

Next, we’ll load Stable Diffusion 1.5. The diffusers library offers a convenient, high-level wrapper called StableDiffusionPipeline, which is ideal for this use case. Don’t worry about all the parameters for now—the highlights include the following:

There are many models with the Stable Diffusion architecture, so we need to specify the one we want to use. We are going to use stable-diffusion-v1-5/stable-diffusion-v1-5, a mirror of the original Stable Diffusion 1.5 model released by RunwayML.
We need to specify the precision we’ll load the model with. Precision is something you’ll learn more about later. At a high level, models are composed of many parameters (millions or billions of them). Each parameter is a number learned during training, and we can store these parameters with different levels of precision (in other words, we can use more bits to store the model). A larger precision allows the model to store more information, but it also requires more memory and computation. On the other hand, we can use a lower precision by setting torch_dtype=float16 and use less memory than the default float32. When doing inference (a fancy way of saying “executing” the models), using float16 is usually fine.¹

The first time you run this code, it can take a bit: the pipeline downloads a model of multiple gigabytes, after all! If you load the pipeline a second time, it will redownload the model only if there has been a change in the remote repository that hosts the model on Hugging Face.² Hugging Face libraries store the model locally in a cache, making things much faster for subsequent loads:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    variant="fp16",
).to(device)

Now that the model is loaded, we can define a prompt—the text input the model will receive. We can then pass the prompt through the model and generate our first image based on that text! Try inputting the following prompt:

prompt = "a photograph of an astronaut riding a horse"
pipe(prompt).images[0]

00_01_intro_files/figure-asciidoc/cell-6-output-2

Exciting! With a couple of lines of code, we generated a new image. Play with the prompt and generate new images. You might notice two things. First, running the same code will generate different images each time. This is because the diffusion process is stochastic in nature, meaning it uses randomness to generate images. We can control this randomness by setting a seed:

import torch
torch.manual_seed(0)

Second, the generated images are not perfect. They might have artifacts, be blurry, or not match the prompt at all. We’ll explore these limitations and how to improve the quality of the generated images in later chapters . For instance:

Chapters 4 and 5 dive into all the components behind diffusion models and how to get from text to new images. They rely on foundational methods like AutoEncoders—introduced in Chapter 3—that can learn efficient representations from input data and reduce the compute requirements to build diffusion and other generative models.
In Chapter 7, you’ll learn how to teach new concepts to Stable Diffusion. For example, we can teach Stable Diffusion the concept of “my dog” to generate images of the author’s dog in novel scenarios, such as “my dog visiting the moon”.
Chapter 8 shows how diffusion models can be used for more than just image generation, such as editing images with a prompt or filling empty parts of an image.

Generating Text

Just as diffusers is a very convenient library for diffusion models, the popular transformers library is extremely useful for running transformer-based models and adapting to new use cases. It provides a standardized interface for a wide range of tasks, such as generating text, detecting objects in images, and transcribing an audio file into text.

The transformers library provides different layers of abstractions. For example, if you don’t care about all the internals, the easiest is to use pipeline, which abstracts all the processing required to get a prediction. We can instantiate a pipeline by calling the pipeline() function and specifying which task we want to solve, such as text-classification:

from transformers import pipeline

classifier = pipeline("text-classification", device=device)
classifier("This movie is disgustingly good!")

[{'label': 'POSITIVE', 'score': 0.9998536109924316}]

The model correctly predicted that the sentiment in the input text was positive. By default, the text-classification pipeline uses a sentiment analysis model under the hood, but we can also specify other transformer-based text-classification models.

Similarly, we can switch the task to text generation (text-generation), with which we can generate new text based on an input prompt. By default, the pipeline uses the GPT-2 model. The transformer pipeline uses a default maximum number of words to generate, so don’t be surprised if the output is truncated. You’ll learn later how to change this:

from transformers import set_seed

# Setting the seed ensures we get the same results every time we run this code
set_seed(10)

generator = pipeline("text-generation", device=device)
prompt = "It was a dark and stormy"
generator(prompt)[0]["generated_text"]

It was a dark and stormy year, and my mind went blank," says the 27-year-old,
who has become obsessed with art, poetry and music since moving to France.
"I don't really know why, but there are things

Although GPT-2 is not a great model by today’s standards, it gives us an initial example of transformers’ generation capabilities while using a small model. The same concepts you learn about with GPT-2 can be applied to models such as Llama or Mistral, some of the most powerful open-access models (at the time of writing). Throughout the book, we’ll strike a balance between the quality and size of the models. Usually, larger models have higher-quality generations. At the same time, we want people with consumer computers or access to free services, such as Google Colab, to be able to create new generations by running code:

Chapter 2 will teach you how transformer models work under the hood. We’ll dive into different types of transformer models and how to use them for generating text.
Chapter 6 will teach you how to continue training transformer models with our data for different use cases. This will allow us to make conversational models like those you might have used with ChatGPT or Gemini. We’ll also discuss efficient training approaches so that you can train transformer models on your computer.

Generating Sound Clips

Generative models are not limited to images and text. Models can generate videos, short songs, synthetic spoken speech, protein proposals, and more!

Chapter 9 dives deep into audio-related tasks that can be solved with ML, such as transcribing meetings and generating sound effects. For now, we can limit ourselves to the now-familiar transformers pipeline and use the small version of MusicGen, a model released by Meta to generate music conditioned on text:

pipe = pipeline("text-to-audio", model="facebook/musicgen-small", device=device)
data = pipe("electric rock solo, very intense")

print(data)

{'audio': array([[[0.12342193, 0.11794732, 0.14775363, ..., 0.0265964 ,
         0.02168683, 0.03067675]]], dtype=float32), 'sampling_rate': 32000}

Later, you’ll learn how audio data is represented and what these numbers are. Of course, there’s no way for us to print the audio file directly in the book! The best alternative is to show a viewer in our notebook or save the audio to a file we can play with our favorite audio application. For example, we can use IPython.display() for this:

import IPython.display as ipd

display(ipd.Audio(data["audio"][0], rate=data["sampling_rate"]))

Ethical and Societal Implications

While generative models offer remarkable capabilities, their widespread adoption raises important considerations around ethics and societal impact. It’s important to keep them in mind as we explore the capabilities of generative models. Here are a few key areas to consider:

Privacy and consent: The ability of generative models to generate realistic images and videos based on very little data poses significant challenges to privacy. For example, creating synthetic images from a small set of real images from an individual raises questions about using personal data without consent. It also increases the risk of creating deepfakes, which can be used to spread misinformation or harm individuals.
Bias and fairness: Generative models are trained on large datasets that contain biases. These biases can be inherited and amplified by the generative models, as we’ll explore in Chapter 2. For example, biased datasets used to train image-generation models may generate stereotypical or discriminatory images. It’s important to consider mitigating these biases and to ensure that generative models are used fairly and ethically.
Regulation: Given the potential risks associated with generative models, there is a growing call for regulatory oversight and accountability mechanisms to ensure responsible development. This includes transparency requirements, ethical guidelines, and legal frameworks to address the misuse of generative models.

It’s important to approach generative models with a thoughtful and ethical mindset. As we explore the capabilities of these models, we’ll also consider the ethical implications and how to use them responsibly.

Where We’ve Been and Where Things Stand

The research into and development of generative models began decades ago with efforts focused on rule-based systems. As computing power and data availability increased, generative models evolved to use statistical methods and ML. With the emergence of deep learning as a powerful paradigm in ML and breakthroughs in the fields of image and speech recognition, generative models have advanced significantly. Although invented decades ago, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have become widely popular in the last decade. CNNs revolutionized image-processing tasks, and RNNs brought sequential data-modeling capabilities, enabling tasks like translating text and text generation.

The introduction of Generative Adversarial Networks (GANs) by Ian Goodfellow in 2014, and variants such as Deep Convolutional GANs (DCGANs) and conditional GANs, brought a new era of generative models. GANs have been used to generate high-quality images and applied to tasks like style transfer, enabling users to apply artistic styles to their images with astonishing realism. Although quite powerful, the quality of GANs has been surpassed by diffusion models in recent years.

Similarly, although RNNs were the to-go tool for language modeling, transformer models, including architectures like GPT, achieved SOTA performance in Natural Language Processing (NLP). These models have demonstrated remarkable capabilities in tasks such as language understanding, text generation, and machine translation. GPT, in particular, became extremely popular because of its ability to generate coherent and contextually relevant text. Not long afterward, a huge wave of generative language models emerged.

The field of generative AI is more accessible than ever because of the rapid expansion of research, resources, and development in recent years. A growing community interested in the area, a rich open source ecosystem, and research facilitating deployment have led to a wide range of applications and use cases. Since 2023, a new generation of models that can generate high-quality images, text, code, videos, and more has emerged; examples include ChatGPT, DALL·E, Imagen, Stable Diffusion, Llama, Mistral, and many others.

How Are Generative AI Models Created?

Typically, the creation of AI models comes down to big budgets or open source.

Several of the most impressive generative models in the past couple of years were created by influential research labs in big, private companies. OpenAI developed ChatGPT, DALL·E, and Sora; Google built Imagen, Bard, and Gemini; and Meta created Llama and Code Llama.

There’s a varying degree of openness in the way these models are released. Some can be used via specific UIs, some have access through developer APIs, and some are just announced as research reports with no public access at all. In some cases, code and model weights are released as well: these are usually called open source releases because those are the essential artifacts necessary to run the model on your hardware. Frequently, however, they are kept hidden for strategic reasons.

At the same time, an ever-increasing, energetic, and enthusiastic community uses open source models as the clay for their creativity. All types of practitioners, including researchers, engineers, tinkerers, and amateurs, build on top of one another’s work and come up with novel solutions and clever ideas that push the field forward, one commit at a time. Some of these ideas make their way into the theoretical corpus of knowledge where researchers draw from, and new impressive models that use them come out after a while.

Big models, even when hidden, serve as inspiration for the community, whose work yields fruits that serve the field as a whole.

This cycle can work only because some of the models are open source and can be used by the community. Companies that release open source models don’t do it for altruistic reasons but because they discover economic value in this strategy. By providing code and models that are adopted by the community, they receive public scrutiny with bug fixes, new ideas, derived model architectures, or even new datasets that work well with the models released. Because all these contributions are based on the assets they published, these companies can quickly adopt them and thus move faster than they would on their own. When Meta released Llama, one of the most popular language models (LMs), a thriving ecosystem organically grew around it.

Both established and new companies alike, including Meta, Stability AI (Stable Diffusion), or Mistral AI, have embraced varying degrees of open source as part of their business strategy. This is as legitimate as the strategy of competing companies that prefer to keep their trade secrets behind closed doors (even if those companies can also draw from the open source community).

At this point, we’d like to clarify that model releases are rarely truly open source. Unlike in the software world, source code is not enough to fully understand an ML system. Model weights are not enough either: they are just the final output of the model training process. Being able to exactly replicate an existing model would require the source code used to train the model (not just the modeling code or the inference code), the training regime and parameters, and, crucially, all the data used for training. None of these, and particularly the data, are usually released.

If there were access to these details, it would be possible for the community and the public to understand how the model works, explore the biases that may afflict it, and better assess its strengths and limitations. Access to the weights and model code provides an imperfect estimation of all this knowledge, but the actual hard data would be much better. On top of that, even when the models are publicly released, they often come out with a special license that does not adhere to the Open Source Initiative’s definition of open source. This is not to say that the models are not useful or that the companies are not doing a good thing by releasing them, but it’s an important context to keep in mind and one of the reasons we’ll often say open access instead of open source.

Be that as it may, there has never been a better time to build generative models or with generative models. You don’t need to be an engineer in a top-notch research lab to come up with ideas to solve the problems that interest you or to contribute to the field. We hope you find these pages helpful in your journey!

Summary

Hopefully, after generating your first images, text, and audios, you’ll be excited to learn how diffusion and transformers work under the hood, how to adapt them for new use cases, and how to use them for different creative applications. Although this chapter focused on high-level tools, we’ll build solid foundations and intuition on how these models work as we embark on our generative journey. Let’s go ahead and learn about the principles of generative models!

¹ You might wonder about the variant parameter. For some models, you might find multiple checkpoints with different precision. When specifying torch_dtype=float16, we download the default model (float32) and convert it to float16. By also specifying the fp16 variant, we’re downloading a smaller checkpoint already stored in float16 precision, which requires half the bandwidth and storage to download it. Check the model you want to use to find out if there are multiple precision variants.

² Hugging Face repositories are Git-based repositories under the hood.

Get Hands-On Generative AI with Transformers and Diffusion Models now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hands-On Generative AI with Transformers and Diffusion Models by Omar Sanseviero, Pedro Cuenca, Apolinário Passos, Jonathan Whitaker