Chapter 1. Introduction: What Is It like to Be a Language Model?

What is it like to be a bat?

The philosopher Thomas Nagel asks this question in his 1974 essay on consciousness.¹ His position is that the answer is unknowable. If I imagine that I have webbed arms and poor vision, perceive the world by sonar, subsist on a diet of insects, and spend the day hanging upside down, “it tells me only what it would be like for me to behave as a bat behaves.” But if I try to imagine what it’s like for a bat to be a bat, my restrictions to the limited range of my own mind and experiences render this impossible.

Humans and bats, at least at the time of this writing, have no shared language. On the other hand, countless AI models exist in the world—many of which have been created specifically to communicate something to us in our own language.

The recent explosion in advances in machine learning has brought a myriad of interesting, powerful, and increasingly opaque models. Simultaneously, the recent movement toward democratization of AI has lowered the barriers for being a data scientist and using machine learning models. It is simple to deploy a model in the real world without being concerned about explaining the output or without exploring the ethical implications of decisions that will be made on the basis of that output. The ease of creating and using machine learning models is going up; the ease of understanding what machine learning models are doing is going down (Figure 1-1).

Why Do You Need to Read This Report?

We use a myriad of technological advances every day without wanting or needing to understand the details of how they work. You may not know exactly how your car, your toaster, or even your house was put together, but you have at least a general intuition of motion, electricity, and architecture. You (mostly) do not anthropomorphize them to the extent of accusing your toaster of being biased toward burning your bread, or suspecting your car’s air conditioner of deliberately refusing to work on especially hot days. When you observe such effects, you know to change your toaster settings or to take your car in for a checkup.

The human race is in a much earlier stage of its relationship with AI than it is with cars and appliances. We use AI every day without wanting or needing to understand the details of how it works—search, communication, vision, automation, the list goes on. However, we have not yet settled into a comfortable intuition of the general functionality. Unlike with toasters, we struggle to keep from anthropomorphizing AI—and language models (LM) are one of the more challenging examples of this, as language output is made to be interpreted.

This report is not going to teach, or even get into, how to run and deploy language models. Many wonderful practical resources are available for this, at various levels of granularity.² You may have never personally run a language model (or a system with a language model component). Alternatively, you may be creating, training, and running language models on a regular basis, or consuming them in downstream applications. Wherever you fall in that range, we suppose the following:

You have at least some surface familiarity with probability, matrices, and machine learning tasks such as classification, summarization, and translation (you are not reading these terms for the first time here). You likewise have at least heard of language models, though you don’t need to be clear on what exactly they are or how they work.
You want to be able to take a step back from the logistical details and have a thoughtful conversation with the various stakeholders in your business about high-level language model concepts (as the primary focus of this report is on neural network language models).
You want to have the same comfortable handle on the models you come in contact with as you do with toasters; you need to know that electricity heats the filaments and the generated heat cooks the bread, but not the atomic-level physics and chemistry involved in either process.
And finally, you want to have reasonable confidence in how to think critically about applying language models to your business. (You can toast a waffle but not a slice of cheese—or rather, you technically can, but you will have a mess on your hands!)

Let’s begin by turning a critical eye to what a language model is, what it does, and how it approaches language.

What Is a Language Model?

Language models are at the root of many of our experiences with technology. Among many, many examples, you are interacting with a language model when:

You read the summary of a large set of product reviews.
You ping the HR chatbot to find out this year’s company holidays.
Your phone offers weirdly prescient next-word suggestions.
And for us, when we enter “language” in our Google search box and the top autocomplete suggestion is “language model”—because we have been working on this report!

A language model is a technique for calculating the probability of a particular sequence of words occurring. It tries to emulate certain human linguistic capabilities, learning a myriad of associations between words to represent, depending on the task at hand, your own language patterns, the patterns of a set of other people, the patterns of English in general, and so on.

But remember, a human will not know what it is like to be a bat through imitating bat behavior. Likewise, an LM will not know what it is like to be a human through imitating human interaction with language. An LM learns only what it would be like if it behaved like a human, within the model’s capabilities. And since this report is written for a human audience (not for LMs), we will spend the rest of the report providing some perspective on how humans should think about what an LM is doing when it is trying to behave like a human.

What Does a Language Model Do?

So what do LMs do? Consider a simple game between two people, popular in improv comedy. One person begins by throwing out a single word, the second person gives the next word to follow, the first person gives the third word, and so on: “I,” “went,” “to,” “the,” “mall,” “and,” “found,” “a,” “stupendous,” “llama,” “that,” “was,” “purple.” The object is often to see who is first to laugh or to find themselves unable to continue the game. The direction the generated story takes depends on the background knowledge of the players, their previous experience with the game, their goals (to keep going for as long as possible? To cause the other person to laugh?), and many other subtle and variable aspects.

In the simplest terms, each time a trained language model is invoked, it’s playing the next round of this game. The prompt may be something like a part of a sentence to be continued (“I went to the”), or something to be translated (speech or text in a different language). The object is to output the best possible word(s) for the given prompt, and what the model chooses as the best output depends on how it gets trained.

The model is trained by playing the game of “predict the next word,” where all the best guesses are already known (the sentences or translations already exist). In each round, the model competes with itself, trying to guess the next word correctly, and tweaking its point of view to get closer to the existing text.

Note

A language model (LM) is a technique for calculating the probability of a particular sequence of words occurring. In plain terms, the model is always playing the game of “predict the next word.”

Are Language Models like Humans?

To understand how an LM approaches language, let’s first consider how people think about language.

A fluent English speaker expects to come across “red book” rather than “book red,” and “hot dog” rather than “cold mouse,” and “early bird” rather than “late bird.” When asked why this is, a human may possibly be able to articulate the reasons for these examples: in English, adjectives usually precede nouns, and a “hot dog” is a phrase describing a popular food and not just a synonym for “overheated canine,” and an “early bird” is a description of early risers or early arrivers, as well as part of a well-known proverb. However, a human’s simplest answer would be that you “just see” the common phrases, and “just don’t see” the uncommon ones. This hodgepodge of human responses draws on the two components of language: structure and meaning.

However, LMs are aware of only structure. They are entirely focused on outputting something grammatically correct—without the notion that such a thing as grammar exists. A good language model (of English) will also correctly rank “red book” as more likely than “book red,” and so on, because it will have “read” enough English text to learn what is common and expected.³

But the model can never know what “hot dog” actually means. For that matter, the model does not know what “hot” and “dog” mean separately, nor is it able to grasp the concept of words as having meaning. In the same way, an LM cannot intentionally tell lies, obfuscate facts, or spare your feelings. When we encounter language, we experience the illusion of meaning because we must. LMs do not because they cannot.

Note

Language models don’t make judgments; they make predictions. Language models do not mean what they say. Language models generate well-formed language, and humans experience it as an illusion of meaning.

How Does a Language Model Learn?

Language models (and AI models more generally) are constructed through the framing of “becoming as good as possible” at a specific task. An LM “understands” a task only in terms of the content encountered during training for it: for input that is this, the output is that. To further generalize the statement: for input that is like this, desired output is like that.

The representation of all the necessary and sufficient information to calculate like is usually referred to as the feature space. Perfecting the calculation for weighing this information in preparation for unseen input is the process of learning, or training. This discussion is a simplification, but it applies to everything else discussed in this report as well as to artificial intelligence more broadly.

Note

The process of creating a trained model is the journey

from

for input that is this, the output is that

for input that is like this, desired output is like that

representing all the necessary and sufficient information to calculate like.

As we described, the most straightforward task for an LM is learning to predict the next word: for an input sequence of K words, the output is the word W that the language model deems the most likely to come next.⁴ The model is exposed to training data comprising many input and output examples.⁵ (Examples just from the preceding sentence include the input “model” with the output “is,” the input “exposed to” with the output “training,” and so on.) For more advanced LMs, the input is more complicated than the few words preceding the output, incorporating additional contextual information from the rest of the sentence or from even further away.

During this learning process, the model constructs its representation to “understand” the language of the training data, and adjusts that representation until it comes as close as it can to matching the examples in the training data. Think of your own experience training for a test by using flash cards, with a prompt on one side and the answer on the other. You check your mastery of the material by how closely your answer to a prompt matches the other side of the card. The language model is learning from many (many!) such flash cards. But what exactly is this representation that is being learned?

How Does an LM Represent Language?

Language models use three general approaches to encode the information that represents language:

Linguistic: The model manually encodes the grammar structures, which is difficult and time-consuming to do, but is also explicit and fully explainable. The model represents language as a set of rules. One such rule for English would be that an adjective usually comes before the noun it is modifying (“red book,” not “book red”), though there are exceptions, such as when used with some verbs (“the cake tastes great,” not “the great cake tastes” or “the cake great tastes”).
Probabilistic/statistical: The model, using a reference text, counts word occurrences, and relies on those counts when playing “predict the next word.” As an overly simple example, a bigram (two-word) language model counts five instances of the word “red” in the reference text, of which two are “red book” and three are “red shoes.” When asked to predict the next word after a prompt of “red,” the model will answer “book” with two-fifths probability and “shoes” with three-fifths probability. An LM can learn these probabilities for a word sequence of any size (one word, two words, three words, etc.). Additional statistical tricks can address issues such as dealing with previously unseen sequences. The model represents language as a set of word sequences and their associated probabilities. Compared to modern neural network language models, statistical language models are both far better understood and, at the current time, waning in popularity.
Embeddings: The model represents every word in the language as a vector in a large dimensional space. This is the representation primarily used by deep neural network language models.

The third approach, embeddings, is both the most difficult one to understand and the one used by most state-of-the-art language models, so we will now spend a little extra time with it.

Consider the following (well-known) analogy: man:woman::king:?⁶ Or, “man” is to “woman” as “king” is to what word? With reasonable fluency in English, you will have little difficulty coming up with the answer of “queen.” You might explain your reasoning in a couple of ways:

The definition of “king” is (simplistically) “royal man.” What word is defined as “royal woman”? The answer is “queen.”
The difference between “man” and “woman” is a change in gender from “male” to “female” (again, simplistically, and fully acknowledging the existence of nuances we are ignoring for the sake of the example). What do you get when you take “king” and change its gender from “male” to “female”? The answer is “queen.”

In your mind, you have a concept of the difference between “man” and “woman” as being like the difference between “king” and “queen”; similarly, “man” and “king” are different, like “woman” and “queen” are different. We have words to represent these differences: “gender” and “royalty.” Furthermore, adding any other words does not change the relationships: a “tall man” is like a “tall king” in the exact same way that a “tall woman” is like a “tall queen”—that is to say, the difference between them is still only the concept of “royalty.”

Taking a critical eye to the previous sentences, it is clear that we as humans define all these words in a circular way, relative to each other (all words are, of course, defined by other words). We hold some representation of these words in our mind, and when probed, use other words to describe the meaning of that representation—the concepts of royalty, gender, and so on.

LMs, as we have discussed, do not grasp the idea of meaning. But in the space of word embeddings, they have a crisp and quantifiable idea of differences and of objects being like other objects:

A word: A point in a multidimensional embedding space, with specific coordinates
A difference between two words: The distance between their coordinates, represented by a vector (which has a size and direction)

Determining to what extent a relationship between one pair of words (“man” → “woman”) is like the one between a second pair of words (“king” → “queen”) is equivalent to calculating how similar the differences (size and direction of movement in the embedding space) are between the words of the first pair and the words of the second.

Figure 1-2 illustrates how an LM would see the four words of our analogy as points in an embedding space.

We can now loosely translate our human rationales into two potential paths to take to arrive at the answer to the analogy:

Start at “king.” Move to “man.” Move to “woman.” Move the exact same distance, but in the opposite direction, as when you moved from “king” to “man.” What word is closest to the point at which you have ended up? The answer is “queen.”
Start at “woman.” Move to “man.” Move to “king.” Move the exact same distance, but in the opposite direction, as when you moved from “woman” to “man.” What word is closest to the point at which you have ended up? The answer is “queen.”

Note

The dimensions in the n-dimensional space of word embeddings do not have any inherent meaning. Similarly, the model cannot put any meaningful interpretation on the distance between word coordinates.

A few final words on embeddings. It should come as no surprise that the real world is messier than our simplified example with four words. The word embeddings that an LM learns are based on the text it is trained on, and the resulting encoded relationships will include, sometimes in strange and hard-to-track ways, all the biases and idiosyncrasies of that text (recall that LMs make predictions, not judgments).⁷ We further discuss these extremely important considerations toward the end of the report.

In addition, many words have multiple meanings, often encompassing multiple parts of speech, such as both verb and noun,⁸ rendering their vector representations potentially ambiguous and certainly not as clean as in the preceding example. And the dimensions of that embedding space do not have any inherent meaning (remember, LMs work only with structure, not meaning). Word descriptions of this purely mathematical space have the same effect as word descriptions of being a bat—we can discuss hanging upside down, but we’re really talking about a whole different animal.

Road Map of the Rest of the Report

As we’ve mentioned, LMs use three general approaches: (1) manual, based on linguistics; (2) statistical, based on probabilities; and (3) neural, based on embeddings. This report focuses on the last of these. The current state-of-the-art LMs are either entirely composed of neural networks or have a significant neural architecture component. At the same time, neural network–based LMs are far more difficult to get a comfortable sense of, compared to manual or statistical LMs. Our goal for this report is for you to gain this intuition without getting tangled in the mathematics of the inner workings of these models.

In the following chapters, we explain how neural LMs are used in completing a few of the most common and important text tasks, including text summarization, translation, and reading comprehension. We discuss the practical differences between how humans and LMs approach these tasks. LMs are seeking to emulate the behavior of a human who is generating language, in the best way they can, given how they are constructed. Therefore, throughout this report, we compare how humans and LMs approach these tasks, so that we, as humans, can better understand how machines understand language. The rest of the report is organized as follows:

First, we dive deeper into the basic task of an LM, to predict the next word, and use it as a lens to understand neural network language—specifically, Recurrent Neural Networks and their close relatives.
We build on this to consider the task of abstractive text summarization as a lens to understand Encoder-Decoder architecture (also referred to as Sequence-to-Sequence).
Then, we discuss machine translation and use it as a lens to understand the attention mechanism and Transformer architecture.
Finally, we’ll gather the knowledge gained from the previous sections to explore the current state of machine language understanding, and focus on what LMs are good at as well as on their risks and weaknesses.

¹ Thomas Nagel, “What Is It Like to Be a Bat?” The Philosophical Review 83, no. 4 (October 1974): 435-450.

² Here’s one good guide to NLP systems in a business setting: Practical Natural Language Processing by Sowmya Vajjala et al. (O’Reilly).

³ “Read” is in quotes to emphasize that language models don’t read in the same way humans do.

⁴ LM inputs and outputs, of course, can include punctuation.

⁵ For a large language model, many training examples are necessary.

⁶ For the academic origins of the analogy, see “Efficient Estimation of Word Representations in Vector Space” by Tomas Mikolov et al. If you’re interested in further academic digging on word analogies, see “Word Embeddings, Analogies, and Machine Learning: Beyond King - Man + Woman = Queen” by Aleksandr Drozd et al.

⁷ This is especially true when learning on many examples.

⁸ For just a few examples, words that can be either a verb or a noun include “run,” “walk,” “bat,” and “throw.”

Get Language Models in Plain English now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Language Models in Plain English by Austin Eovito, Marina Danilevsky

Chapter 1. Introduction: What Is It like to Be a Language Model?

Figure 1-1. Recent advancements in the field of machine learning have meant that creating and running models is becoming much easier, while simultaneously, understanding what models are doing is becoming harder.

Why Do You Need to Read This Report?

What Is a Language Model?

What Does a Language Model Do?

Note

Are Language Models like Humans?

Note

How Does a Language Model Learn?

Note

How Does an LM Represent Language?

Note

Road Map of the Rest of the Report

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly