Chapter 1. GPT-4 and ChatGPT Essentials

The ability to unlock the power of artificial intelligence has never been more accessible for developers. Large language models (LLMs) such as GPT-4 and GPT-3.5 Turbo have showcased their capabilities through ChatGPT. Now we find ourselves in a whirlwind of progress, with a pace that has never been seen before in the software world. OpenAI has made these technological innovations readily available; what transformative applications will you craft with the tools now at your disposal?

The implications of these AI models go far beyond chatbots. Thanks to LLMs, developers can now exploit the power of natural language processing (NLP) to create applications that understand users’ needs, transforming what once was science fiction into tangible reality. Moreover, thanks to the new vision capabilities of GPT-4, it is now possible to build software that can interpret and generate text based on images. From innovative customer support systems that learn and adapt to personalized educational tools that understand each student’s unique learning style, GPT language models open up a whole new world of possibilities.

But what are these GPT models? The goal of this chapter is to take a deep dive into their foundations, origins, and key features. By understanding the basics of these AI models, you will be well on your way to building the next generation of LLM-powered applications.

Introducing Large Language Models

This section lays down the fundamental building blocks that have shaped the development of GPT models. We aim to provide a comprehensive understanding of language models and NLP, the role of Transformer architectures, and the tokenization and prediction processes within these models. However, as we will see, this journey does not stop at text processing. The introduction of GPT-4 Vision marks an extension of the capabilities of LLMs beyond text to include the processing of multimodal input. This means that GPT-4 not only is good at text processing but also can interpret images.

Exploring the Foundations of Language Models and NLP

As LLMs, GPT models are among the latest types of models released in the field of NLP, which is itself a subfield of machine learning (ML) and AI. Before we delve into GPT models, it is essential to take a look at NLP and its related fields.

There are different definitions of AI, but the consensus, more or less, is that AI is the development of computer systems that can perform tasks that typically require human intelligence. With this definition, many algorithms fall under the AI umbrella. Consider, for example, the traffic prediction task in GPS applications, or the rule-based systems used in strategic video games. In these examples, as seen from the outside, the machine seems to require intelligence to accomplish these tasks.

ML, as mentioned, is a subset of AI. In ML, we do not try to directly implement the decision rules used by the AI system. Instead, we try to develop algorithms that allow the system to learn by itself. Since the 1950s, when ML research began, many ML algorithms have been proposed in the scientific literature.

Among them, deep learning algorithms have come to the fore. Deep learning is a branch of ML that focuses on algorithms inspired by the structure of the brain. These algorithms are called artificial neural networks. They can handle very large amounts of data and perform very well on tasks such as image and speech recognition and NLP.

The GPT models are based on the Transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. from Google. Transformers are like reading machines. They leverage an attention mechanism to prioritize different parts of the text, allowing for a deeper understanding of context and enabling coherent outputs. This approach enables them to grasp the meaning of words within sentences, improving their performance in language translation, question answering, and text generation. Figure 1-1 visually represents these core concepts and their role in enhancing the capabilities of transformer models for various language tasks.

NLP is a subfield of AI focused on enabling computers to process, interpret, and generate natural human language. Modern NLP solutions are based on ML algorithms. The goal of NLP is to allow computers to process natural language text. This goal covers a wide range of tasks:

Text classification: Categorizing input text into predefined groups. This includes, for example, sentiment analysis and topic categorization. Companies can use sentiment analysis to understand customers’ opinions about their services. Email filtering is an example of topic categorization in which email can be put into categories such as “Personal,” “Social,” “Promotions,” and “Spam.”
Automatic translation: Automatic translation of text from one language to another. Note that this can include areas such as translating code from one programming language to another, like from Python to C++.
Question answering: Answering questions based on a given text. For example, an online customer service portal could use an NLP model to answer FAQs about a product, or educational software could use NLP to provide answers to students’ questions about the topic being studied.
Text generation: Generating a coherent and relevant output text based on a given input text, called a prompt.

As mentioned earlier, LLMs are ML models that try to solve text generation and other tasks. LLMs enable computers to process, interpret, and generate human language, allowing for more effective human-machine communication. To be able to do this, LLMs analyze or train on vast amounts of text data and thereby learn patterns and relationships between words in sentences. A variety of data sources can be used to perform this learning process. This data can include text from Wikipedia, Reddit, the archive of thousands of books, or even the archive of the internet itself. Given an input text, this learning process allows the LLMs to make predictions about the likeliest following words and, in this way, can generate meaningful responses to the input text. The LLM has a very large number of internal parameters, and as it learns, the algorithm that builds the LLM searches for the optimal parameters that will allow the model to make the best possible predictions of the next words. Modern language models, like the latest GPT models, are so large and have been trained on so many texts that they can now directly perform most NLP tasks, such as text classification, machine translation, question answering, and many others.

Note

Different language models have been proposed by OpenAI. At the time of this writing, the latest and most capable ones are models from the GPT-4 series. In addition to its text processing capabilities, GPT-4 Vision also represents a significant advancement as a multimodal model, allowing it to process not only text but also images as input. LLMs are able to interpret images using a specialized Transformer architecture called vision transformer (ViT). More recently, the GPT-4o model goes further in multimodality: it can process and generate text, vision, and audio.

The development of LLMs goes back to the 1990s. It started with simple language models such as n-grams, which tried to predict the next word in a sentence based on the preceding words. N-gram models use frequency to do this. The predicted next word is the word that most frequently follows the previous words in the text the n-gram model was trained on. While this approach was a good start, n-gram models’ need for improvement in understanding context and grammar resulted in inconsistent text generation.

To improve the performance of n-gram models, more advanced learning algorithms were introduced, including recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. These models could learn longer sequences and analyze the context better than n-grams could, but they still needed help processing large amounts of data efficiently. These types of recurrent models were the most efficient ones for a long time and therefore were the types most used in tools such as automatic machine translation.

Understanding the Transformer Architecture and Its Role in LLMs

The Transformer architecture revolutionized NLP, primarily because transformers effectively addressed one of the critical limitations of previous NLP models such as RNNs: the earlier models’ struggle with long input text sequences and with maintaining context over these lengths. In other words, while RNNs tended to forget the context in longer sequences, transformers came with the ability to handle and encode this context effectively.

The central pillar of this revolution is the attention mechanism, a simple yet powerful idea. Instead of treating all words in a text sequence as equally important, the model “pays attention” to the most relevant terms for each step of its task. This mechanism enables direct connections between distant elements in the text, such that the last word can “attend” to the first word without any constraints, overcoming a significant limitation faced by previous models such as RNNs. Cross-attention and self-attention are two architectural blocks based on this attention mechanism, and they are often found in LLMs. The Transformer architecture makes extensive use of these blocks.

Cross-attention helps the model determine the relevance of the different parts of the input text for accurately predicting the next word in the output text. It’s like a spotlight that shines on words or phrases in the input text, highlighting the relevant information needed to make the next word prediction while ignoring less important details.

To illustrate this, let’s take an example of a simple sentence translation task. Imagine we have an input English sentence, “Alice enjoyed the sunny weather in Brussels,” which should be translated into French as Alice a profité du temps ensoleillé à Bruxelles. In this example, let’s focus on generating the French word ensoleillé, which means sunny. For this prediction, cross-attention would give more weight to the English words sunny and weather since they are both relevant to the meaning of ensoleillé. By focusing on these two words, cross-attention helps the model generate an accurate translation for this part of the sentence. Figure 1-2 illustrates this example.

Self-attention refers to the ability of a model to focus on different parts of its input text. In the context of NLP, the model can evaluate the importance of each word in a sentence compared to the other words. This allows it to better understand the relationships between the words and helps the model build new concepts from multiple words in the input text.

As a more specific example, consider the following: “Alice received praise from her colleagues.” Assume that the model is trying to understand the meaning of the word her in the sentence. The self-attention mechanism assigns different weights to the words in the sentence, highlighting the words relevant to her in this context. In this example, self-attention would place more weight on the words Alice and colleagues. Self-attention helps the model build new concepts from these words. In this example, one of the concepts that could emerge would be “Alice’s colleagues,” as shown in Figure 1-3.

Unlike the recurrent architecture, transformers also have the advantage of being easily parallelized. This means the Transformer architecture can process multiple parts of the input text simultaneously rather than sequentially. This allows for faster computation and training because different parts of the model can work in parallel without waiting for previous steps to complete, unlike in recurrent architectures, which require sequential processing. The parallel processing capability of transformer models fits perfectly with the architecture of graphics processing units (GPUs), which are designed to handle multiple computations simultaneously. Therefore, GPUs are ideal for training and running these transformer models because of their high parallelism and computational power. This advance allowed data scientists to train models on much larger datasets, paving the way for developing LLMs.

The Transformer architecture is a sequence-to-sequence model that was originally developed for sequence-to-sequence tasks such as machine translation. A standard transformer consists of two primary components, an encoder and a decoder, both of which rely heavily on attention mechanisms. The task of the encoder is to process the input text, identify valuable features, and generate a meaningful representation of that text, known as embedding. The decoder then uses this embedding to produce an output, such as a translation or summary. This output effectively interprets the encoded information.

Generative pre-trained transformers, commonly known as GPTs, are a family of models that are based on the Transformer architecture and that specifically utilize the decoder part of the original architecture. In GPT architecture, the encoder is not present, so there is no need for cross-attention to integrate the embeddings produced by an encoder. As a result, GPTs rely solely on the self-attention mechanism within the decoder to generate context-aware representations and predictions. Note that other well-known models, such as BERT (Bidirectional Encoder Representations from Transformers), are based on the encoder part. We don’t cover this type of model in this book. Figure 1-4 illustrates the evolution of these different models.

Demystifying the Tokenization and Prediction Steps in GPT Models

LLMs receive a prompt as input, and in response they generate a text. This process is known as text completion. For example, the prompt could be The weather is nice today, so I decided to, and the model output might be go for a walk. You may be wondering how the LLM model builds this output text from the input prompt. As you will see, it’s mostly just a question of probabilities.

When a prompt is sent to an LLM, it first breaks the input into smaller pieces called tokens. These tokens represent single words, parts of words, or spaces and punctuation. For example, the preceding prompt could be broken down like this: [“The”, “wea”, “ther”, “is”, “nice”, “today”, “,”, “so”, “I”, “de”, “ci”, “ded”, “to”]. Each language model comes with its own tokenizer. The tokenizer from the GPT-3.5 and GPT-4 series is available online on the OpenAI platform for testing purposes.

Tip

A rule of thumb for understanding tokens in terms of word length is that 100 tokens equal approximately 75 words for an English text. For other languages, this may not be true, and the number of tokens may be higher for the same number of words.

Thanks to the attention principle and the Transformer architecture, the LLM processes these tokens and can interpret the relationships between them and the overall meaning of the prompt. The Transformer architecture allows a model to efficiently identify the critical information and the context within the text.

To create a new sentence, the LLM predicts the tokens most likely to follow, based on the input context of the prompt provided by the user. OpenAI has produced many versions of GPT-4; at first, you had a choice between an input context window of 8,192 tokens and one of 32,768 tokens. As of early 2024, the latest models released by OpenAI are GPT-4 Turbo and GPT-4o, with a larger input context window of 128,000 tokens, equivalent to almost three hundred pages of English text. Unlike the previous recurrent models, which had difficulty handling long input sequences, the Transformer architecture with the attention mechanism allows the modern LLM to consider the context as a whole. Based on this context, the model assigns a probability score for each potential subsequent token. The token with the highest probability is then selected as the next token in the sequence. In our example, after The weather is nice today, so I decided to, the next best token could be go.

Note

As we’ll see in the next chapter, via a parameter temperature, instead of selecting the next token with the highest probability, it’s also possible to allow the model to take the next token among a set of tokens with the highest probability. This allows for variability and creativity in the model’s response.

This process is then repeated, but now the context becomes The weather is nice today, so I decided to go, with the previously predicted token go added to the original prompt. The second token that the model predicts could be for. This process is repeated until a complete sentence is formed: The weather is nice today, so I decided to go for a walk. This process relies on the LLM’s ability to learn the next most probable word from massive text data. Figure 1-5 illustrates this process.

Integrating Vision into an LLM

GPT-4 Vision introduces multimodality capabilities to the GPT-4 series, extending its utility beyond text. The specific mechanisms enabling this feature remain proprietary and undisclosed. However, insights can be drawn from open source LLMs that integrate visual data, providing a basis for understanding the potential methodologies employed by GPT-4 to achieve such multimodal functionality. This section delves into the processes observed in these open source counterparts to shed light on how image-text integration may be realized in GPT-4.

Convolutional neural networks (CNNs) have long been one of the state-of-the-art techniques for image processing tasks. CNNs are very good at tasks like image classification and object detection, which they achieve by using layers of filters that slide across an input image. These filters can maintain the spatial relationships between the pixels of the image. Thanks to this layer of filters, CNNs can recognize patterns ranging from simple edges in the early layers to complex shapes and objects in the deeper layers.

However, similar to how the introduction of Transformer architectures in 2017 revolutionized NLP by superseding RNNs, new models based on Transformer architectures were proposed for image processing in 2020. Since then, the long-standing dominance of CNNs in image processing has been challenged. In 2021, a Google paper called “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al. showed that a pure transformer model called vision transformer (ViT) can perform better than CNNs for many image classification tasks.

You may be wondering how the transformer processes the image data. From a high-level point of view, it’s quite similar to what is done with text. As we saw before, when a text with a prompt is sent to an LLM, the LLM first breaks the text into smaller pieces of characters called tokens and then processes those tokens to predict the next token. In the case of an image, the ViT will first split the image into fixed-size patches. Figure 1-6 gives an example of this process.

These image patches are then integrated with the text tokens in a unified input sequence. Without going into too much technical detail, when an LLM processes text data, all the tokens are first projected into a high-dimensional space. In other words, each token is transformed into a high-dimensional vector, and this mapping function between the tokens and the high-dimensional vectors is computed during the learning process of the LLM. It is almost the same process for the fixed-size patches of the image. A mapping function between the patches and the same high-dimensional space is calculated during the learning process. In this way, with mapping functions, tokens and patches are all put in the same high-dimensional space. The combined sequence of text and image can then be processed through the Transformer architecture to predict the next token. The fact that it is possible to integrate these visual patches with the textual tokens in the same high-dimensional representation space allows the model to apply self-attention mechanisms across these two modalities and allows the model to generate responses that consider both text and image information. For a Python developer, this ability to process images can potentially greatly impact how users will interact with your AI application, such as through more intuitive chatbots or educational tools that can understand and explain content from images.

A Brief History: From GPT-1 to GPT-4

This section reviews the evolution of the OpenAI GPT models from GPT-1 to GPT-4.

GPT-1

In mid-2018, just one year after the invention of the Transformer architecture, OpenAI published a paper titled “Improving Language Understanding by Generative Pre-training”, by Radford et al., in which the company introduced the Generative Pre-trained Transformer, also known as GPT-1.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning techniques, which use large amounts of manually labeled data. For example, in a sentiment analysis task where the goal is to classify whether a given text has positive or negative sentiment, a common strategy would require collecting thousands of manually labeled text examples to build an effective classification model. However, the need for large amounts of well-annotated, supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

In their paper, the creators of GPT-1 proposed a new learning process in which an unsupervised pre-training step is introduced. In this pre-training step, no labeled data is needed. Instead, the model is trained to predict what the next token is. Thanks to the use of the Transformer architecture, which allows parallelization, this pre-training was performed on a large amount of data. For the pre-training, the GPT-1 model used the BookCorpus dataset, which contains the text of approximately 11,000 unpublished books. This dataset was first presented in 2015 in the scientific paper “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books” by Zhu et al. and was initially made available on a University of Toronto web page. However, the official version of the original dataset is no longer publicly accessible.

The GPT-1 model was found to be effective in a variety of basic completion tasks. In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset. However, since GPT-1 is a small model, it was unable to perform complex tasks without fine-tuning. Therefore, fine-tuning was performed as a second supervised learning step on a small set of manually labeled data to adapt the model to a specific target task. For example, in a classification task such as sentiment analysis, it may be necessary to retrain the model on a small set of manually labeled text examples to achieve reasonable accuracy. This process allowed the parameters learned in the initial pre-training phase to be modified to better fit the task at hand.

Despite its relatively small size, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning. The GPT-1 architecture consisted of a decoder similar to the original transformer, which was introduced in 2017 and had 117 million parameters. This first GPT model paved the way for more powerful models with larger datasets and more parameters to take better advantage of the potential of the Transformer architecture.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model that increased the number of parameters and the size of the training dataset tenfold. The number of parameters in this new version was 1.5 billion, trained on 40 GB of text. In November 2019, OpenAI released the full version of the GPT-2 language model.

Note

GPT-2 is publicly available and can be downloaded from Hugging Face or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves its ability to process tasks and outperforms the state of the art on many jobs. It also showed that larger language models can process natural language better.

GPT-3

OpenAI released version 3 of GPT in June 2020. The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training. GPT-3 is a much larger model than GPT-2, with 175 billion parameters, allowing it to capture more complex patterns. In addition, GPT-3 was trained on a more extensive dataset. This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia. This training dataset, which includes content from websites, books, and articles, allowed GPT-3 to develop a deeper understanding of the language and context. As a result, GPT-3 demonstrated improved performance on a variety of linguistics tasks. It also demonstrated superior coherence and creativity in its generated texts. It was even capable of writing code snippets, such as SQL queries, and performing other intelligent tasks. Furthermore, GPT-3 eliminated the need for a fine-tuning step, which was mandatory for its predecessors.

From GPT-3 to InstructGPT

However, with GPT-3 there was a problem of misalignment between the tasks given by end users and what the model had seen during its training. As we’ve said, language models are trained to predict the next token based on the input context. This training process is not necessarily directly aligned with the tasks that end users want the model to perform. In addition, increasing the size of language models does not inherently make them better at following user intent or instructions. Moreover, models like GPT-3 were trained on data from different sources on the internet. Although a cleanup has been conducted in the selection of sources, the learning data may contain false or problematic text, including racist text, sexist text, misinformation, or disinformation. As a result, the model may sometimes say incorrect or even toxic things.

In 2021 a new release of the GPT-3 model called the Instruct series was published. Unlike the original GPT-3 base model, the Instruct models are optimized by reinforcement learning from human feedback (RLHF), meaning that they use feedback from humans to learn and to improve over time. This allows the models to learn from human instructions while making them more truthful and less toxic.

To illustrate the difference, we input the prompt Explain what is meant by time complexity, and we receive the following:

With the standard GPT-3 base, we obtain this output: Explain what is meant by space complexity. Explain what is meant by the big-O notation.
With the InstructGPT-3 model, we obtain: Time complexity is a way of measuring the amount of time it takes for an algorithm to run and complete its task. It is usually expressed using Big O notation, which measures the complexity of an algorithm in terms of the number of operations it performs. The time complexity of an algorithm is important because it determines how efficient the algorithm is and how well it scales with larger inputs.

You can see that for the same input, the first model cannot answer the question (the answer is even weird), whereas the second model does answer the question. Users expect to interact with an assistant, which is not something that standard GPT-3 does well, so additional updating with RLHF is used to achieve the desired behavior. It is, of course, possible to obtain the desired response with a standard GPT-3 base model. However, contrary to Instruct models, it is necessary to apply specific prompt design and optimization techniques to obtain the desired output from the GPT-3 model. This technique is called prompt engineering and will be detailed in the coming chapters.

OpenAI explains how the Instruct series was constructed in the scientific paper “Training Language Models to Follow Instructions with Human Feedback” by Ouyang et al.

The training recipe has two main stages to go from a GPT-3 model to an instructed GPT-3 model: supervised fine-tuning (SFT) and RLHF. In each stage, the results of the prior stage are fine-tuned. That is, the SFT stage receives the GPT-3 model and returns a new model, which is sent to the RLHF stage to obtain the instructed version.

Figure 1-7, adapted from the scientific paper from OpenAI, details the entire process.

We will step through these stages one by one.

In the SFT stage, the original GPT-3 model is fine-tuned with straightforward supervised learning (step 1 in Figure 1-7). OpenAI has a collection of prompts made by end users. The process starts with the random selection of a prompt from the set of available prompts. A human (called a labeler) is then asked to write an example of an ideal answer to this prompt. This process is repeated thousands of times to obtain a supervised training set composed of prompts and the corresponding ideal responses. This dataset is then used to fine-tune the GPT-3 model to give more consistent answers to user requests. The resulting model is called the SFT model.

The RLHF stage is divided into two substeps. First, a reward model (RM) is built (step 2 in Figure 1-7), and then the RM is used for reinforcement learning (step 3 in Figure 1-7).

The goal of the RM is to automatically give a score to a response to a prompt. When the response matches what is indicated in the prompt, the RM score should be high; when it doesn’t match, it should be low. To construct the RM, OpenAI begins by randomly selecting a question and using the SFT model to produce several possible answers. (As we will see later, it is possible to produce many responses with the same input prompt via a parameter called temperature.) A human labeler is then asked to rank the responses based on criteria such as fit with the prompt and toxicity of the response. After this procedure has been run many times, a dataset is used to fine-tune the SFT model for scoring. This RM will be used to build the final InstructGPT model.

The final step in training InstructGPT models involves reinforcement learning, an iterative process. It starts with an initial generative model, such as the SFT model. Then a random prompt is selected, and the model predicts an output, which the RM evaluates. Based on the reward received, the generative model is updated accordingly. This process can be repeated countless times without human intervention, providing a more efficient and automated approach to adapting the model for better performance.

InstructGPT models are better at producing accurate completions for what people give as input in the prompt. OpenAI recommends using the InstructGPT series rather than the original series.

GPT-3.5, ChatGPT, Codex

In March 2022, OpenAI made available new versions of GPT-3. These new models could edit text or insert content into text. They were trained on data through June 2021 and were described as more powerful than previous versions. At the end of November 2022, OpenAI began referring to these models as belonging to the GPT-3.5 series.

In November 2022, OpenAI also introduced ChatGPT as an experimental conversational tool. The model behind this tool was a fine-tuned version of GPT-3.5 called GPT-3.5 Turbo. This model excelled at interactive dialogue, using a technique similar to that shown in Figure 1-7, but for chat.

Note

When ChatGPT was first launched, there was some equivalence between the name of the model used by ChatGPT and the name of the chatbot web interface that was using this language model, and ChatGPT could be used to designate both the model and the web interface.

In the rest of this book, we will make this distinction:

GPT-3.5 and GPT-4 refer to two families of OpenAI’s large language models, each family having several versions of a model.
ChatGPT refers to the chat web interface that uses these models. ChatGPT runs by default with a GPT-3.5 model.

OpenAI also proposed the Codex model, a GPT-3 model that is fine-tuned on billions of lines of code and that powers the first GitHub Copilot autocompletion programming tool to assist developers of many text editors, including Visual Studio Code, JetBrains, and even Neovim. However, the Codex model was deprecated by OpenAI in March 2023. Instead, OpenAI recommends that users switch from Codex to GPT-3.5 Turbo or GPT-4. At the same time, GitHub released Copilot X, which is based on GPT-4 and provides much more functionality than the previous version.

Warning

OpenAI’s deprecation of the Codex model serves as a stark reminder of the inherent risk of working with APIs: they can be subject to changes or discontinuation over time as newer, more efficient models are developed and rolled out.

GPT-4

In March 2023, OpenAI made GPT-4 available. We know very little about the architecture of this new type of model, as OpenAI has provided little information. It is OpenAI’s most advanced system to date and should produce more secure and useful answers. The company claims that GPT-4 surpasses GPT-3.5 Turbo in its advanced reasoning capabilities.

Note

When the model was released, OpenAI published a technical report assessing the model’s capabilities that included lots of comparisons with previous OpenAI models such as InstructGPT and GPT-3.

Unlike the other models in the OpenAI GPT family, GPT-4 is the first multimodal model capable of receiving not only text but also images. This means that GPT-4 considers both the images and the text in the context that the model uses to generate an output sentence, which makes it possible to add an image to a prompt and ask questions about it.

Initially, OpenAI did not make this feature publicly available in GPT-4. It was in November 2023 that OpenAI announced the GPT-4 Turbo model with vision capabilities. This new model also came with a new 128 token context window. This means a single input prompt can be equivalent to 300 pages of English text! Moreover, this GPT-4 Turbo model is also cheaper than the original GPT-4.

In the example shown in Figure 1-8, we wrote an equation on a piece of paper, took a picture of it, and asked GPT-4 Turbo to describe the equation in the picture. As you can see, the model easily recognized that it was the golden ratio.

There are now many models on the market, and it is becoming necessary to compare them objectively to determine which one performs better at what tasks. One way to do this is simply to assess their results on university exams. It’s in this context that the models have also been evaluated on various tests, and GPT-4 has outperformed GPT-3.5 Turbo by scoring in higher percentiles among the test takers. For example, on the Uniform Bar Exam , GPT-3.5 Turbo scored in the 10th percentile, whereas GPT-4 scored in the 90th percentile. Similar results were obtained in the International Biology Olympiad , in which GPT-3.5 Turbo scored in the 31st percentile and GPT-4 in the 99th percentile. This progress is very impressive, especially considering that it was achieved in less than one year. Recently, OpenAI released GPT-4o (“o” for “omni”), its latest flagship model; this model seems to outperform the previous GPT-4 model on various benchmarks.

Another popular way to compare language models is to ask humans to rate different interactions with different models blindly, so that they don’t know which model they are talking to. The LMSYS Chatbot Arena Leaderboard, hosted on Hugging Face, provides such comparisons. LMSYS Chatbot Arena is a crowdsourced, randomized battle platform for LLMs. On this platform, users can talk to two randomly selected models at the same time, without knowing which model they’re talking to, and then vote on which of the two responses they find most relevant. It’s like a competitive game, with tournaments, and something called the ELO score is used to rate the models (see “Why Is the ELO Used to Compare the Models?” for more).

Why Is the ELO Used to Compare the Models?

The ELO rating system was created by Arpad Elo, a Hungarian-born American physics professor and master-level chess player. Arpad Elo proposed this system as an improvement over an earlier system used by the United States Chess Federation (USCF). The USCF adopted Elo’s system in 1960, and the World Chess Federation adopted it in 1970. It is now used to classify players in other competitive fields, such as video games, where it is used to classify League of Legends players.

The ELO system is also used to compare LLMs. Competitions between two LLMs are represented in a blind comparison in which a human asks two LLMs questions (the two models receive the same input messages), and the user has to say which of the two models’ answers is the best. The LMSYS Chatbot Arena Leaderboard, which is hosted on Hugging Face, allows for easy comparisons between chatbots.

The ELO system can be used to rank players in competitive zero-sum games with tournaments. Zero-sum games are games in which one player’s gain is equal to another player’s loss. The difficulty of ranking tournaments comes from the dynamic nature of the confrontations between players and the continual arrival of new competitors. This rating system is designed to be flexible; it adjusts players’ rankings by taking into account the outcome of each game and provides a way to evaluate the relative skill levels of players.

The ELO rating assigns a number to each player, with a higher number indicating a superior skill level when comparing players. One of the key features of the ELO rating system is that it provides a direct way to estimate the probability that one player will defeat another based only on the difference between their ELO scores.

Assuming that E_i and E_j are, respectively, the ELO of players i and j, the probability that player i will win a game is given by the following:

P (i wins against j) = \frac{1}{1 + 10^{(E_{j} - E_{i}) / 400}}

At the time of this writing, the three best models are all GPT-4 models, and the one with the highest ELO score is a GPT-4o model, gpt-4o-2024-05-13. Fourth place is held by the model Gemini 1.5 Pro from Google. GPT-3.5 Turbo is in the 30th position.

If you present two models—for instance, gpt-4o-2024-05-13 with a score of 1287 and GPT-3.5-Turbo-0613 with a score of 1120—to a person without revealing which models they are, then it is possible to estimate the probability that the person will prefer the model gpt-4-turbo-2024-04-09 by inserting the ELO values into the provided equation. In this case, the probability estimate is 72%.

Table 1-1 summarizes the evolution of the GPT models.

Table 1-1. Evolution of the GPT models
2017	The paper “Attention Is All You Need” by Vaswani et al. is published.
2018	The first GPT model is introduced with 117 million parameters.
2019	The GPT-2 model is introduced with 1.5 billion parameters.
2020	The GPT-3 model is introduced with 175 billion parameters.
2022	The GPT-3.5 (ChatGPT) model is introduced with 175 billion parameters.
2023	The GPT-4 model is introduced, but the number of parameters is not disclosed.
2024	The GPT-4o model is introduced by OpenAI in mid-2024.

Note

You may have heard the term foundation model before. Unlike traditional models trained for specific tasks, foundation models are trained on a wide variety of data. This extensive training gives these models a deep understanding of different domains—and that knowledge can then be fine-tuned to enable the models to perform specific tasks. The GPT models are foundation models. As we’ve seen, they demonstrate a remarkable ability to generate humanlike text across various topics. Through fine-tuning, the broad knowledge of GPT can be specialized to excel at various tasks ranging from writing articles to programming. This allows foundation models to adapt to tasks in healthcare, finance, and more by exploiting their vast, domain-agnostic knowledge base.

The Evolution of AI Toward Multimodality

As we’ve mentioned, transformers and language models were historically dedicated to text processing tasks only. The first Transformer architecture, proposed by Vaswani et al. in the 2017 paper “Attention Is All You Need”, addressed the problem of text translation. It won’t be long before these technologies based on transformers are applied to other types of data. GPT-4 already has vision capabilities that allow the model to consider an image in its input context when generating a response to a given input prompt. But these aren’t the only modalities you can use in your applications. OpenAI provides other tools you can use with Python. These tools, accessible via the OpenAI API, are not embedded within the LLMs themselves but serve as complementary technologies that you, as a developer, can leverage to imbue your applications with a broader spectrum of AI functionalities.

Image generation with DALL-E

Through the OpenAI API, your applications can call the DALL-E 2 or DALL-⁠E 3 models directly. These models are text-to-image models. DALL-E 3, the more advanced iteration, has the capacity to incorporate text within images and supports both landscape and portrait orientations. The images generated by DALL-E 3 are, in general, significantly more attractive and detailed than those of DALL-E 2. DALL-⁠E 3 can also understand significantly more complex prompts. These models offer developers the ability to create visually compelling content directly from textual descriptions, opening new avenues for creative and practical applications.

Voice recognition and synthesis

OpenAI has also trained a neural network transformer called Whisper. Whisper excels at speech recognition in over 50 languages, with particular strength in English, where it achieves near-human capabilities. OpenAI has open sourced Whisper’s code, but as a developer, you also have access to this tool via the OpenAI API. Whisper enables developers to create applications capable of understanding spoken language with remarkable accuracy.

In parallel, the OpenAI audio API provides access to two text-to-speech models. One is optimized for real-time text-to-speech use cases, and the other is optimized for quality. You have a choice among six voices, and although the models perform best in English, they support more than 50 languages.

Video generation with Sora

At the time of writing, OpenAI has announced its new text-to-video tool. Sora is not yet available to developers, but it is an indication of what will be available in the near future. With a simple prompt, the tool is supposed to be capable of generating videos up to 60 seconds long.

All these extensions of AI capabilities with multimodality tools open new horizons for you as a developer and for your applications. With these technologies, you can create more interactive applications that engage users across multiple modalities: image, voice, text . . . and soon video.

LLM Use Cases and Example Products

OpenAI includes many inspiring customer stories on its website. This section explores some of these applications, use cases, and product examples. We will discover how these models may transform our society and open new opportunities for business and creativity. As you will see, many businesses already use these new technologies, but there is room for more ideas. It is now up to you.

Be My Eyes

Since 2012, Be My Eyes has created technologies for a community of several million people who are blind or have limited vision. For example, it has an app that connects volunteers with blind or visually impaired persons who need help with everyday tasks, such as identifying a product or navigating within an airport. With only one click in the app, the person who needs help is contacted by a volunteer who, through video and microphone sharing, can help the person.

The new multimodal capacity of GPT-4 makes it possible to process both text and images, so Be My Eyes began developing a new virtual volunteer based on GPT-4. This new virtual volunteer aims to reach the same level of assistance and understanding as a human volunteer.

“The implications for global accessibility are profound,” says Michael Buckley, CEO of Be My Eyes. “In the not-so-distant future, the blind and low-vision community will utilize these tools not only for a host of visual interpretation needs but also to have a greater degree of independence in their lives.”

At the time of this writing, the Be My Eyes AI assistant is still in open beta. It is available for iOS users and is rolling out to Android users.

Morgan Stanley

Morgan Stanley is a multinational investment bank and financial services company in the United States. As a leader in wealth management, Morgan Stanley has a content library of hundreds of thousands of pages of knowledge and insight covering investment strategies, market research and commentary, and analyst opinions. This vast amount of information is spread across multiple internal sites and is mostly in PDF format. This means consultants must search a large number of documents to find answers to their questions. As you can imagine, this search can be long and tedious.

The company evaluated how it could leverage its intellectual capital with GPT’s integrated research capabilities. The resulting internally developed model will power a chatbot that performs a comprehensive search of wealth management content, efficiently unlocking Morgan Stanley’s accumulated knowledge. In this way, GPT-4 has provided a way to analyze all this information in a format that is much easier to use.

Khan Academy

Khan Academy is a US-based nonprofit educational organization founded in 2008 by Sal Khan. Its mission is to create a set of free online tools to help educate students worldwide. The organization offers thousands of math, science, and social studies lessons for students of all ages. In addition, the organization produces short lessons through videos and blogs, and recently it began offering Khanmigo, a new AI assistant powered by GPT-4.

Khanmigo can do a lot of things for students, such as guiding and encouraging them, asking them questions, and preparing them for tests. Khanmigo is designed to be a friendly chatbot that helps students with their classwork. It does not give students answers directly but instead guides them in the learning process. Khanmigo can also support teachers by helping them make lesson plans, complete administrative tasks, and create lesson books, among other things.

“We think GPT-4 is opening up new frontiers in education. A lot of people have dreamed about this kind of technology for a long time. It’s transformative, and we plan to proceed responsibly with testing to explore if it can be used effectively for learning and teaching,” says Kristen DiCerbo, chief learning officer at Khan Academy.

Duolingo

Duolingo is a US-based educational technology company, founded in 2011, that produces applications used by millions of people who want to learn a second language. Duolingo users need to understand the rules of grammar to learn the basics of a language. And they need to have conversations, ideally with a native speaker, to understand those grammar rules and master the language. This is not possible for everyone.

Duolingo has added two new features to the product using OpenAI’s GPT-4: Roleplay and Explain My Answer. These features are available in a new subscription level called Duolingo Max. With these features, Duolingo has bridged the gap between theoretical knowledge and the practical application of language. Thanks to LLMs, Duolingo allows learners to immerse themselves in real-world scenarios.

The Roleplay feature simulates conversations with native speakers, allowing users to practice their language skills in a variety of settings. The Explain My Answer feature provides personalized feedback on grammar errors, facilitating a deeper understanding of the structure of the language.

“We wanted AI-powered features that were deeply integrated into the app and leveraged the gamified aspect of Duolingo that our learners love,” says Edwin Bodge, principal product manager at Duolingo.

The integration of GPT-4 into Duolingo Max not only enhances the overall learning experience but also paves the way for more effective language acquisition, especially for those without access to native speakers or immersive environments. This innovative approach should transform the way learners master a second language and contribute to better long-term learning outcomes.

Yabble

Yabble is a market research company that uses AI to analyze consumer data in order to deliver actionable insights to businesses. Its platform transforms raw, unstructured data into visualizations, enabling businesses to make informed decisions based on customer needs.

The integration of advanced AI technologies such as GPT into Yabble’s platform has enhanced its consumer data-processing capabilities. This enhancement allows for a more effective understanding of complex questions and answers, enabling businesses to gain deeper insights based on the data. As a result, organizations can make more informed decisions by identifying key areas for improvement based on customer feedback.

“We knew that if we wanted to expand our existing offers, we needed artificial intelligence to do a lot of the heavy lifting so that we could spend our time and creative energy elsewhere. OpenAI fit the bill perfectly,” says Ben Roe, Head of Product at Yabble.

Waymark

Waymark provides a platform for creating video ads. This platform uses AI to help businesses easily create high-quality videos without the need for technical skills or expensive equipment.

Waymark has integrated GPT into its platform, which has significantly improved the scripting process for platform users. This GPT-powered enhancement allows the platform to generate custom scripts for businesses in seconds. This enables users to focus more on their primary goals, as they spend less time editing scripts and more time creating video ads. The integration of GPT into Waymark’s platform therefore provides a more efficient and personalized video creation experience.

“I’ve tried every AI-powered product available over the last five years but found nothing that could effectively summarize a business’s online footprint, let alone write effective marketing copy, until GPT-3,” says Waymark founder Nathan Labenz.

Inworld AI

Inworld AI provides a developer platform for creating AI characters with distinct personalities, multimodal expression, and contextual awareness.

One of the main use cases of the Inworld AI platform is video games. The integration of GPT as the basis for the character engine of Inworld AI enables efficient and rapid video game character development. By combining GPT with other ML models, the platform can generate unique personalities, emotions, memories, and behaviors for AI characters. This process allows game developers to focus on storytelling and other topics without having to invest significant time in creating language models from scratch.

“With GPT-3, we had more time and creative energy to invest in our proprietary technology that powers the next generation of non-player characters (NPCs),” says Kylan Gibbs, chief product officer and cofounder of Inworld.

Beware of AI Hallucinations: Limitations and Considerations

As you have seen, an LLM generates an answer by predicting the next words (or tokens) one by one based on a given input prompt. In most situations, the model’s output is relevant and entirely usable for your task, but it is essential to be careful when you are using language models in your applications, because they can give incoherent answers. These answers are often referred to as hallucinations. An AI hallucination occurs when AI gives you a confident response that is false or that refers to imaginary facts. This can be dangerous for users who rely on GPT. You need to double-check and critically examine the model’s response.

Consider the following example. We start by asking the model to do a simple calculation: 2 + 2. As expected, it answers 4. So it is correct. Excellent! We then ask it to do a more complex calculation: 3,695 × 123,548. Although the correct answer is 456,509,860, the model gives a wrong answer with great confidence, as you can see in Figure 1-9. And when we ask it to check and recalculate, it still gives a wrong answer.

Although, as we will see, you can add new features to ChatGPT using a plug-in system, ChatGPT does not include a calculator by default. To answer our question of what is 2 + 2, ChatGPT generates each token one at a time. It answers correctly because it probably has often seen “2 + 2 equals 4” in the texts used for its training. It doesn’t really do the calculation—it’s just text completion.

Warning

It is likely that the GPT-3.5 model running behind ChatGPT has seldom, if at all, seen the numbers we chose for the multiplication problem (3,695 and 123,548) in its training. This is why it makes a mistake. The model hallucinates. And as you can see, even when it makes a mistake, it can seem reasonably sure about its wrong output. So be careful if you use the model in one of your applications. If GPT makes mistakes, your application may get inconsistent results. Note that math errors are only one type of hallucination.

Notice that ChatGPT’s result is close to the correct answer and not completely random. It is an interesting side effect of its algorithm that, even though it has no mathematical capabilities, it can give a close estimation with a language approach only.

Note

OpenAI introduced the ability to use plug-ins with GPT-4. As we will see in Chapter 5, these tools allow you to add additional functionality to the LLM. One such tool is a calculator that helps GPT correctly answer these types of questions.

In the preceding example, ChatGPT made an unintentional mistake. But in some cases, it can be deliberately deceitful, as demonstrated in Figure 1-10.

ChatGPT begins by claiming that it cannot access the internet. However, if we insist, something interesting happens—see Figure 1-11.

ChatGPT now implies that it did access the link. However, this is definitely not possible at the moment. ChatGPT is blatantly leading the user to think that it has capabilities it doesn’t have. By the way, as Figure 1-12 shows, there are more than three zebras in the image.

Warning

ChatGPT and other GPT-4 models are, by design, not reliable; they can make mistakes, give false information, or even mislead the user. We highly recommend using pure GPT-based solutions only for creative applications and not for question answering where the truth matters, such as for medical tools. For such use cases, as you will see, plug-ins are probably an ideal solution.

Unlocking GPT Potential with Advanced Features

In addition to the completion feature of the language models provided by OpenAI, more advanced techniques can be used to further exploit their capabilities. This book looks at some of these methods:

Plug-ins
Prompt engineering
Retrieval-augmented generation (RAG)
Fine-tuning
GPTs and the Assistants API

A GPT model has some limitations, for example, with calculations. As you’ve seen, a GPT model can correctly answer simple math problems like 2 + 2 but may struggle with more complex calculations like 3,695 × 123,548.

In the web interface of ChatGPT, the plug-in service provided by OpenAI allows the model to be connected to applications that may be developed by third parties. These plug-ins enable the language models, running in the interface of ChatGPT, to interact with developer-defined APIs, and this process can potentially greatly enhance the capabilities of the GPT models, as they can access the outside world through a wide range of actions.

Note

OpenAI offers a paid subscription to ChatGPT users called ChatGPT Plus. With this subscription, ChatGPT comes with three additional tools: web browsing, DALL-E image generation, and code interpreter, as well as the possibility to change from GPT-3.5 to GPT-4.

Plug-ins for ChatGPT can allow GPT models to do many things, such as the following:

Retrieve real-time information such as sports scores, stock quotes, the latest news, and more
Perform actions on behalf of the user, such as booking a flight, ordering food, and so on
Execute accurate math calculations
Retrieve knowledge-based information such as corporate documents, personal notes, and more

These are just a few examples of use cases; it is up to you to find new ones.

The last example in the preceding list involves adding a knowledge base to the model, which can then be used in the input context to enable the language model to answer questions that are specific to the end user’s needs. This use case, called retrieval-augmented generation (RAG), is becoming increasingly popular today. We will detail RAG concepts in Chapter 4 and give implementation examples in Chapters 3 and 5.

This book also examines fine-tuning techniques. As you will see, fine-tuning can improve the accuracy of an existing model for a specific task. Fine-tuning involves retraining a model on a specific dataset to optimize its performance in a downstream task. This process refines the model’s internal weights, enhancing its ability to capture task-specific subtleties and improving its effectiveness in the desired context. For example, a model fine-tuned on a financial corpus would demonstrate superior abilities to interpret financial discourse and generate relevant content, as its recalibrated parameters would be better aligned with the semantic and syntactic patterns of financial language.

OpenAI has introduced the concept of what it calls (perhaps confusingly) GPTs in the ChatGPT web interface tool. This type of GPT can be thought of as a kind of wrapper that combines a language model (which may be fine-tuned), various tools, and some documentation that the model can use, plus specific instructions that explain the role and task of the language model. It brings these elements together to create an AI agent, which is then specialized to perform a particular task.

For example, you could imagine developing a GPT for sports nutrition advice. To do this, you could give the GPT different tools, such as an algorithm that can calculate personalized meal plans based on user input. The documentation that you provide to the GPT could, for instance, cover topics such as the importance of macronutrients and micronutrients for athletes. Finally, the specific instructions for the model could explain in text that the model should provide accurate and personalized sports nutrition advice to the end user. This is just an example of what you can do with a GPT. And one more important thing to add: GPTs run in the ChatGPT web interface, and you don’t need coding skills to create them. GPTs open up a new era of customizable AI agents without requiring a lot of IT skills.

Note

The word GPT now has two different meanings. Either it refers to the transformer-based models described in the paper “Improving Language Understanding by Generative Pre-training,” published by OpenAI in 2018, or it refers to the customizations just mentioned. The context should allow you to determine the intended meaning.

The term plug-ins has also evolved in the context of the OpenAI ecosystem. Initially, OpenAI explored the concept of plug-ins to extend the capabilities of models in ChatGPT by integrating third-party services. However, in more recent documentation about GPTs, the term refers to actions. Actions introduce many new features while retaining many of the core ideas of plug-ins.

The Assistants API allows you to build AI assistants within your own applications. There are a lot of similarities with the GPTs just described. Like GPTs, an assistant uses a language model (which may be fine-tuned), has instructions, and can leverage tools and knowledge to respond to user queries. This integration creates a seamless ecosystem in which you, as a developer, can tailor AI capabilities to fit the unique needs of your end users. The Assistants API differs from GPTs in that GPTs are designed to be called from the ChatGPT web interface—you can’t call your GPTs in your Python application via APIs. However, with the Assistants API, developers gain the ability to directly integrate these AI assistants into their applications, offering a more customized and interactive user experience. In short, GPTs allow for a high degree of customization without coding skills, whereas the Assistants API requires more technical skills for integration into your application.

For developers, plug-ins and GPTs potentially open up many new opportunities. In the future, every company may want to have its own plug-in or GPTs for LLMs. OpenAI has opened a store to share thousands of GPTs developed by the partners of OpenAI and by the community. OpenAI also says that a GPT Builder revenue program will be launched in the first half of 2024, to be available in the US first. No further information on this revenue program is available at the time of writing. The number of applications that could be added via plug-ins or GPTs could be enormous.

Summary

LLMs have come a long way, starting with simple n-gram models and moving to RNNs, LSTMs, and advanced transformer-based architectures. LLMs are computer programs that can process and generate humanlike language, with ML techniques to analyze vast amounts of text data. By using self-attention and cross-attention mechanisms, transformers have greatly enhanced language understanding.

This book explores how to use GPT models, as they offer advanced capabilities for understanding and generating context. Building applications with them goes beyond the scope of traditional BERT or LSTM models to provide humanlike interactions.

Since early 2023, GPT models have demonstrated remarkable capabilities in NLP. As a result, they have contributed to the rapid advancement of AI-enabled applications in various industries. Different use cases already exist, ranging from applications such as Be My Eyes to platforms such as Waymark, which are testaments to the potential of these models to revolutionize how we interact with technology.

It is important to keep in mind the potential risks of using these LLMs. As a developer of applications that will use the OpenAI API, you should be sure that users know the risk of errors and can verify the AI-generated information.

The next chapter will give you the tools and information to use the OpenAI models available as a service so that you can be part of this incredible transformation we are experiencing today.

Get Developing Apps with GPT-4 and ChatGPT, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 1. GPT-4 and ChatGPT Essentials

Introducing Large Language Models

Exploring the Foundations of Language Models and NLP

Figure 1-1. A nested set of technologies from AI to transformers

Note

Understanding the Transformer Architecture and Its Role in LLMs

Figure 1-2. Cross-attention uses the attention mechanism to focus on essential parts of the input text (English sentence) in order to predict the next word in the output text (French sentence)

Figure 1-3. Self-attention allows the emergence of the “Alice’s colleagues” concept

Figure 1-4. The evolution of NLP techniques from n-grams to the emergence of LLMs

Demystifying the Tokenization and Prediction Steps in GPT Models

Tip

Note

Figure 1-5. The completion process is iterative, token by token

Integrating Vision into an LLM

Figure 1-6. An image is split into fixed-size patches before being injected into the transformer

A Brief History: From GPT-1 to GPT-4

GPT-1

GPT-2

Note

GPT-3

From GPT-3 to InstructGPT

Figure 1-7. The steps to obtain the instructed models (redrawn from an image by Ouyang et al.)

GPT-3.5, ChatGPT, Codex

Note

Warning

GPT-4

Note

Figure 1-8. The visual capabilities of GPT-4 in action (February 2024)

Note

The Evolution of AI Toward Multimodality

Image generation with DALL-E

Voice recognition and synthesis

Video generation with Sora

LLM Use Cases and Example Products

Be My Eyes

Morgan Stanley

Khan Academy

Duolingo

Yabble

Waymark

Inworld AI

Beware of AI Hallucinations: Limitations and Considerations

Figure 1-9. ChatGPT hallucinating bad math (April 22, 2023)

Warning

Note

Figure 1-10. Asking ChatGPT to count zebras in a Wikipedia picture (ChatGPT, May 2023)

Figure 1-11. ChatGPT claiming it accessed the Wikipedia link (May 2023)

Figure 1-12. The zebras ChatGPT didn’t really count

Warning

Unlocking GPT Potential with Advanced Features

Note

Note

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly