Chapter 4. Understanding Language on the Cloud

We have fantasized about using natural language to control computers since the original Star Trek TV series aired in the early ’60s. Characters would often interface with the computer by saying “Computer.” Since then, we have struggled to use language as an interface to computers and instead have relied entirely on symbolic languages like Python. That is, until fairly recently, with the inception of natural language systems giving us new interfaces like Siri and Alexa.

In this chapter we will discuss language and how advances in deep learning have accelerated our understanding of language. We will first look at natural language processing, or NLP, and talk about why we need it. Then we will move on to the art and science of processing language, from decimating language into numbers and vectors to processing those vectors with deep learning. After that we will discuss word context with recurrent neural networks, or RNN, and then move on to generating text with RNN layers. We will then use RNN to build sequences to sequence learning, which is at the heart of understanding machine translation. We will end with an advanced example that attempts to understand language itself using neural transformations.

Here is a high-level overview of the main topics we will cover in this chapter:

  • Natural Language Processing, with Embeddings

  • Recurrent Networks for NLP

  • Neural Translation and the Translation API

  • Natural Language API

  • BERT: Bidirectional Encoder Representations from Transformers

This chapter assumes you have completed the contents of the previous chapters and have worked through a couple of the exercises. The samples in this chapter are intended to be introductory and deal with the broad concept of understanding language. To take these concepts further, you may want to explore more fundamentals of NLP later. In the next section, we look at a new, special type of layer that allows us to understand language context.

Natural Language Processing, with Embeddings

The art and science of natural language processing, NLP, has been around since 1972. Since that time it has evolved from holistic new-age science into defining, cutting-edge research. However, NLP didn’t really become known and admired until deep learning helped prop it up.

Deep learning is advancing in many areas of technology, but among all areas, NLP is likely the one that has most benefited. Ten years ago NLP was a tedious and little-understood practice. That all quickly changed when people found ways of combining it with deep learning.

In order to understand language, we need to process huge amounts of data. Deep learning brought about a revolution in NLP because of its ability to process huge amounts of data. Deep learning has also introduced new concepts that allow us to extract understanding from language. Before we get to that, though, let’s first discuss the basics of processing language.

Understanding One-Hot Encoding

Language, or what we can refer to as text, is used in documents, books, and, of course, speech. We understand language by interpreting sounds and written words, but that won’t work for a computer. Instead, we need to translate the words or characters into numbers. Many techniques have been used to do this, but the current best method is referred to as one-hot encoding.

One-hot encoding is used for a number of areas in deep learning and data science, so we will describe it generally here. Whenever we have more than two classes, instead of denoting a class with a single numeric value, we represent it as a sparse vector. We use the term sparse because the vector is all zeros except for the spot that contains the class. Figure 4-1 shows how we can break down a block of text into a one-hot encoded sequence. Each word in the block of text is represented as a 1 in the table, with each row in the table denoting the encoding for that word. So the encoding for Bunny would be [1,0,0,0,0,0,0]. Notice that the length of the vector needs to account for every word in the whole text. We call the entire collection of text the vocabulary or corpus. In Figure 4-1, the corpus represents only seven words.

Example of One-hot Encoded Text
Figure 4-1. One-hot encoding explained

As for the words and other punctuation in a corpus, even those can and need to be broken down further. For instance, the word hopped from our example is derived from hop. We intuitively know this because of our understanding of language. The word hopped could also be derived from hope, but that would give our example an entirely different meaning. Needless to say, this process of breaking text down into tokens, or tokenizing them, is very complex and uses several different approaches for different applications. The process of tokenizing words, or what NLPers refer to as grams, is outside the scope of this book, but it is essential knowledge and something you should delve into further if you are building any type of serious NLP system.

One-hot encoding is used heavily on text but also on any data that requires multi-class classification. In the next section, we get back to NLP with word embeddings.

Vocabulary and Bag-of-Words

When we break text down into tokens, we are in essence creating a vocabulary or list of tokens. From this list we can count how frequently each token appears in the document. We can then embed those counts into a document vector called a bag-of-words. Let’s look at a simple example of how this works.

Consider the text:

  • the cat sat on the hat

Our vocabulary and counts for this text may look something like:

  • the - 2

  • cat - 1

  • sat - 1

  • hat - 1

  • on - 1

The bag-of-words vector representing this document would then be:

  • 2,1,1,1,1

Notice that the order is irrelevant. We are only interested in how frequently each word appears. Now, say we want to create another bag-of-words vector for a second document, shown below:

  • the cat sat

The bag-of-words vector for this document would be:

  • 1,1,1,0,0

Notice that the vector contains 0s because those tokens don’t appear in the document. Let’s consider a third document:

  • the cat ran away

When we tokenize this document, we need to add two tokens to the vocabulary. In turn, this means that the bag-of-words for the first two documents needs to change. The bag-of-words for our three documents may look something like:

  • 2,1,1,1,1,0,0

  • 1,1,1,0,0,0,0

  • 1,1,0,0,0,1,1

If you look at the shape of the bag-of-words vectors, you can infer that there may be different meanings in the text. Being able to infer meaning from bag-of-words vectors has been successful in classifying everything from spam to movie reviews.

Note

Bag-of-words vectors are a precursor to building another form of encoding called term frequency over inverse document frequency, or TF-IDF. In TF-IDF encoding, we take the frequency of the token and divide it by how frequently the token appears in all documents. This allows you to identify more relevant tokens like ran and away in the previous example.

Word Embeddings

From bag-of-words encodings of text into vectors of numbers, NLP researchers moved on to making sense of and recognizing patterns in the numbers. However, a bag of numbers or words wasn’t much value without some relation or context. What NLP needed to understand is how words or grams relate to others words or grams.

Understanding the similarity or difference between tokens requires us to measure some form of distance. To understand that distance, we need to break our tokens down into vector representations, where the size of the vector represents the number of dimensions we want our words broken into.

Consider our previous example, with the following three documents:

  • the cat sat on the hat

  • the cat sat

  • the cat ran away

With bag-of-words vectors resembling:

  • 2,1,1,1,1,0,0

  • 1,1,1,0,0,0,0

  • 1,1,0,0,0,1,1

We can find the similarity between documents by taking the cosine distance between any of the two vectors. Cosine distance is the dot product of any two vectors divided by the product of the lengths. We can use the SciPy library spatial module to do the calculation like so:

from scipy import spatial

vector1 = [2,1,1,1,1,0,0]
vector2 = [1,1,0,0,0,1,1]

cosine_similarity = 1 - spatial.distance.cosine(vector1, vector2)
print(cosine_similarity)
OUTPUT
0.5303300858899106

When using cosine similarity, the value returned will be from –1 to +1, measuring the similarity or distance between vectors. For our code example, we measure the distance between the first and third example documents. The output of .53 means the documents are similar but not exact. You can use this method to determine document similarity using bag-of-words or TF-IDF vectors.

We can also use this method of similarity testing for the tokens themselves. However, this means that we need to learn what those individual token vectors may look like. To do that, we use a special layer type called an embeddings layer.

Embeddings learn by using a single deep learning layer that learns the weights associated with random word pairings over a number of dimensions. Let’s see how this works by running Example 4-1, where we see how to embed and visualize embeddings.

Example 4-1. Creating word embeddings
  • Open a new Colab notebook or the example Chapter_4_Embedding.ipynb.

  • We start with the typical imports as shown here:

    from __future__ import absolute_import, division, print_function,
                           unicode_literals
    import tensorflow as tf
    
    from tensorflow import keras
    from tensorflow.keras import layers
    import tensorflow_datasets as tfds
  • Next, we will load the data we will use for this example. The dataset we are using here is from the TensorFlow Datasets library, which is a great source to work with. The code to load this dataset is shown here:

    (train_data, test_data), info = tfds.load(
        'imdb_reviews/subwords8k',
        split = (tfds.Split.TRAIN, tfds.Split.TEST),
        with_info=True, as_supervised=True)
  • Running the cell will load the TensorFlow dataset for IMDb Reviews. This dataset was previously tokenized to a vocabulary of 8,000 tokens. The code also splits this dataset into a train and test split, the defaults for a TF project. We use two other options: one for loading info, with_info=True, and the other, as_supervised=True, for supervised learning.

  • After the dataset is loaded, we can also explore some information about how the data was encoded (tokenized) using the following code:

    encoder = info.features['text'].encoder
    encoder.subwords[2000:2010]
    #outputs
    ['Cha',
     'sco',
     'represent',
     'portrayed_',
     'outs',
     'dri',
     'crap_',
     'Oh',
     'word_',
     'open_']
  • The encoder object we return here is of type SubwordTextEncoder, which is an encoder provided with TensorFlow that breaks text down into grams or tokens. We can use this encoder to understand what the vocabulary of words looks like. The second line extracts and outputs the tokens/grams from the encoder. Notice how the tokens may represent full or partial words. Each of the tokens represented has a vector representation of how/when it appears in the sample text.

  • We can take a look at what these encoded vectors look like with the following new block of code:

    padded_shapes = ([None],())
    train_batches = train_data.shuffle(1000).padded_batch(10,
    	padded_shapes = padded_shapes)
    test_batches = test_data.shuffle(1000).padded_batch(10,
    	padded_shapes = padded_shapes)
    
    train_batch, train_labels = next(iter(train_batches))
    train_batch.numpy()
    #outputs
    array([[ 133, 1032,    6, ...,    0,    0,    0],
           [  19, 1535,   31, ...,    0,    0,    0],
           [ 750, 2585, 4257, ...,    0,    0,    0],
           ...,
           [  62,   66,    2, ...,   17, 2688, 8029],
           [ 734,   37,  279, ...,    0,    0,    0],
           [  12,  118,  284, ...,    0,    0,    0]])
  • Before we take a look at the data, we first use the top three lines to pad, shuffle, and batch the data into train and test batches. We do this to make sure that the length of each document/review is the same. Then we extract the data and display the list of vectors for each review. The value at each position in the various lists denotes the words/tokens index in the vocabulary.

Note

There is an important difference here in the encoding vectors. These are not bag-of-words vectors; rather, they ordered vectors of an index into a vocabulary. The key difference is that each index represents a word in that position of text. This is more efficient than bag-of-words encodings because it preserves the context order of words, which is often just as important as the words themselves.

  • With the encoded movie reviews downloaded, we can move on to building our embeddings model, shown here:

    embedding_dim=16
    
    model = keras.Sequential([
      layers.Embedding(encoder.vocab_size, embedding_dim),
      layers.GlobalAveragePooling1D(),
      layers.Dense(16, activation='relu'),
      layers.Dense(1, activation='sigmoid')
    ])
    
    model.summary()
    #outputs
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    embedding_1 (Embedding)      (None, None, 16)          130960
    _________________________________________________________________
    global_average_pooling1d (Gl (None, 16)                0
    _________________________________________________________________
    dense (Dense)                (None, 16)                272
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 17
    =================================================================
    Total params: 131,249
    Trainable params: 131,249
    Non-trainable params: 0
  • This model introduces a couple of new layers. The first, the Embedding layer, is the layer that will train the weights to determine how similar tokens are in a document. After this layer, we push things into a GlobalAveragePooling1D layer. This layer is the same as the pooling layers in CNN, but in this case it is only one-dimensional. We don’t need two dimensions for this data since text is only one-dimensional. From the pooling layer, we then move into a Dense layer of 16 neurons, which finally outputs to a single Dense output layer that uses a sigmoid activation function. Remember, the sigmoid, or squishification, function squishes the output into a range of 0 to 1.

  • Then we move from building to training the model with the following code:

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    history = model.fit(
        train_batches,
        epochs=10,
        validation_data=test_batches, validation_steps=20)
  • As we know, the compile function compiles the model, and the fit function trains it. Notice the choice of optimizer, loss, and metrics in the call to compile. Then in fit, we pass in the training data (train_batches), the number of epochs, the validation data (test_batches), and the number of validation steps. As you run the cell, watch the output and see how the model trains.

  • Next, we can look at the results of training by running the final cell, found in the notebook example. We have seen this code before, so we won’t review it again. The output produces Figure 4-2, and we can see how the model trains and validates.

Example Output from Embedding training
Figure 4-2. Training the embedding model

Now that we have the reviews embedded, what does that mean? Well, embedding (the part learned by that first layer) extracts the similarity between words/tokens/grams in our vocabulary. In the next section, we look at why knowing word similarity matters.

Understanding and Visualizing Embeddings

NLP researchers found that by understanding the relevance of words in documents with respect to other words, we can ascertain the topic or thoughts those documents represent. This is a powerful concept we will explore in more detail in this chapter. Before we do that, however, let’s see how word similarity can be visualized and used for further investigation in Example 4-2.

Example 4-2. Visualize embeddings
  • Embeddings are essentially encoded vectors that represent the similarity between tokens. The representation of that is embedded in a vector of learned weights. We can use those weights to further plot how similar words are using various distance functions like cosine distance. Before that, let’s see how we pull the weights out of the model and write them to a file with the following code:

    e = model.layers[0]
    weights = e.get_weights()[0]
    
    import io
    
    encoder = info.features['text'].encoder
    
    out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
    out_m = io.open('meta.tsv', 'w', encoding='utf-8')
    
    for num, word in enumerate(encoder.subwords):
      vec = weights[num+1] # skip 0, it's padding.
      out_m.write(word + "\n")
      out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_v.close()
    out_m.close()
  • This code extracts the weights from the first layer, layer 0, and then, using the encoder again, it enumerates through every word in the vocabulary and outputs the weights for that word. We write the words into the .meta file and the actual values into the .v file:

    try:
      from google.colab import files
    except ImportError:
       pass
    else:
      files.download('vecs.tsv')
      files.download('meta.tsv')
  • Use this code to download the files we just generated. We need to use those files in an Embeddings viewer or projector. If you encounter issues downloading the files, be sure to follow any prompts. If you have an error with cookies, you need to allow third-party cookies in your browser.

  • After the files are downloaded, open a new browser window to the Embedding Projector page. This page will allow us to upload our saved embeddings and view them.

  • Click the Load button and load the meta.tsv and vec.tsv files as suggested by the dialog prompt. When you are done uploading, click off the window to close it.

  • Click the Sphereize Data checkbox to view the data more spread out. Type the word confe in the search box.

  • Make sure the default analysis or distance method is set to PCA. Distance between vectors determines how similar words are over different spaces. There are a number of ways to determine distance. Figure 4-3 shows the sphereized data and the search set for the token confe. Notice in the figure how similar words are shown on the right side. We can see in the plot how dissimilar the words are to each other, at least in terms of our document corpus.

Visualizing Word Embeddings
Figure 4-3. Visualizing word embeddings
Note

PCA and t-SNE are methods used to visualize word embeddings or vector distances. They are similar to cosine distance but output values in 3D, hence the visualization.

  • You are encouraged to continue playing with the embeddings projector to understand how words are similar depending on distance calculation or function.

A key element in our ability to train deep learning models is to understand how similar or dissimilar they are given a corpus of documents. In our last example, each word/token embedding was represented by 16 dimensions. One way to think about these 16 dimensions is as topics, thoughts, or categories. You may often hear these types of vectors referred to as topic or thought vectors in NLP. The interesting thing to note here is that these topics or thought vectors are learned by the network on its own. By determining the distance between these vectors, we can determine how similar the words are in the context of our corpus (collection of documents).

To represent those topic or thought vectors in space, we can use a variety of visualization techniques. The default technique is principle component analysis, or PCA. PCA is the method by which we take the 16 dimensional vectors and reduce them to a 3D vector we can visualize in space. The project also supports t-SNE and UMAP as other ways to visualize this data. You can also visualize the distance between words using a variety of distance methods. The common method we use to measure distance is cosine or dot project distance. The other methods are euclidean and taxicab distances. Play with the embeddings projector from the last exercise to learn more.

Now that we have a basic understanding of NLP and how data is encoded and embedded, we can move on to making sense of the text in the next section.

Recurrent Networks for NLP

Being able to understand the similarity between words in a corpus has extensive applications to searching, sorting, and indexing data. However, our goal has always been to have a machine understand language. In order to do that, we also need to further understand the context of language. While our current NLP model provides us with some context between words, it ignores other more important contexts, such as token order. Consider the following two sentences:

  • The cow jumps over the moon

  • The moon jumps over the cow

There is no difference between the sentences if we just look at the words/vocabulary [cow, jumps, over, the, moon]. If we looked at the bag-of-words or TF-IDF vectors, tokens like moon and cow would likely show similar values. So how do we as humans understand the difference? It all comes down to the order in which we hear or see the words. If we teach this ordering importance to a network, it likely will understand more about the language and text. Fortunately, another form of layer type was developed called recurrent networks, which we will unfold in the next section.

Recurrent Networks for Memory

Recurrent network layers are a type of layer used to extract features, not unlike CNN, although, unlike CNN, RNN extracts features by sequence or closeness. These layers extract features through order or learning sequences. Recurrent networks don’t convolve but rather transpose. Each neuron in a recurrent network transposes its answer to the next neuron. It does this transposition for every element it exposes down the chain at the network. Figure 4-4 demonstrates how the network accepts an input and transposes it. So an RNN layer composed of four neurons would accept an input of up to four in size. You can think of these four inputs as a moving window through the document.

Visualizing Recurrent Networks
Figure 4-4. Recurrent neural networks visualized

In Figure 4-4 we can see how the layer is rolled up to represent the forward and backward pass through the network. Each move of the window over the document inputs the words or tokens at those positions. Since the output from the previous token is fed into the next token, the network learns to associate the order of tokens in the document. It also learns that things like “cow jumps” and “moon jumps” can represent entirely different meanings. The unrolled network in Figure 4-4 shows how the network backpropagates the error through the network weights. Let’s see what difference adding a recurrent layer to our last example can make in doing text classification in Example 4-3.

Example 4-3. Training sentiment with RNN
  • Open the example Chapter_4_LSTM.ipynb and follow along.

  • This exercise uses several blocks of code from Example 4-1. As such, we will not revisit those here. Instead, we will start by rolling up our data into bigger batches with this code:

    BUFFER_SIZE = 10000
    BATCH_SIZE = 64
    
    train_dataset = train_data.shuffle(BUFFER_SIZE)
    train_dataset = train_data.padded_batch(BATCH_SIZE, train_data.output_shapes)
    
    test_dataset = test_data.padded_batch(BATCH_SIZE, test_data.output_shapes)
    
    train_batch, train_labels = next(iter(train_dataset))
    train_batch.numpy()
    #outputs
    array([[ 249,    4,  277, ...,    0,    0,    0],
           [2080, 4956,   90, ...,    0,    0,    0],
           [  12,  284,   14, ...,    0,    0,    0],
           ...,
           [ 893, 7029,  302, ...,    0,    0,    0],
           [1646, 1271,    6, ...,    0,    0,    0],
           [ 147, 3219,   34, ...,    0,    0,    0]])
  • Notice the difference here. We are increasing the size of our batches to 10,000.

  • Now, on to building the model. This time we add a new recurrent layer called an LSTM, for long short-term memory. LSTMs are a specialized form of recurrent layer that work better at preserving memory and avoiding other issues. We will look at many other types of recurrent layers later in the chapter. The revised code to build that model is shown here:

    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(encoder.vocab_size, 64),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    model.summary()
    #outputs
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    embedding (Embedding)        (None, None, 64)          523840
    _________________________________________________________________
    bidirectional (Bidirectional (None, 128)               66048
    _________________________________________________________________
    dense (Dense)                (None, 64)                8256
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 65
    =================================================================
    Total params: 598,209
    Trainable params: 598,209
    Non-trainable params: 0
  • The increase in embedding dimensions is now 64, from 16. It is these vectors that are fed into a bidirectional LSTM layer. Bidirectional layers allow for context/sequence to be learned forward or backward. Notice how the model output shows that the number of trainable parameters has increased dramatically. This will in turn greatly increase the amount of time this sample takes to train, so make sure these notebooks are set to use GPUs when training. The code to compile and fit the model, shown here, is quite similar to the last exercise:

    model.compile(loss='binary_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(1e-4),
                  metrics=['accuracy'])
    
    history = model.fit(train_dataset, epochs=10,
                        validation_data=test_dataset,
                        validation_steps=30)
  • Nothing new here. This block of code will take a significant amount of time to run. It will take several hours without a GPU, so grab a beverage and pause for a break, or work ahead and come back later. It’s up to you.

  • After training, we can again run our typical training and validation accuracy scores from the history using the standard plotting code. Figure 4-5 shows the output from this sample’s training.

Training/Validation Accuracy for Sample Chapter_4_LSTM
Figure 4-5. Training/validation accuracy for sample

With the model trained, we can now use it to understand text. The IMDb dataset is classified by sentiment. Our networks have been training to identify the sentiment, either good or bad, of movie reviews. This is the reason our model outputs a single binary class. In the next section, we will see how we can use this model to predict whether text about a movie is positive or negative.

Classifying Movie Reviews

Now that we have a model trained on movie review sentiment, we can use that to test whether our own reviews are read as positive or negative by our machine model. Being able to predict text sentiment, whether good or bad, has broad applications. Any industry that acknowledges and responds to feedback will benefit from auto-processing reviewer feedback using AI. If the model is effectively trained, it can provide useful insights and information even outside sentiment. Consider a model trained to identify not just sentiment but perhaps also quality, style, or other factors. The possibilities are endless, provided you can label your training data effectively.

Note

Google provides a data-labeling service that will label your data per your specifications at a reasonable cost. Labeling a large corpus of text in this manner may be ideal for your needs.

In Example 4-4, we turn our model into a review classifier.

Example 4-4. Classifying sentiment with RNN
  • This exercise continues from the last exercise that trained the model. Refer to sample Chapter_4_LSTM.ipynb and follow along.

  • With the model trained, we can now enter some helper functions that will allow us to predict new text the model has never seen before. The helper code is shown here:

    def pad_to_size(vec, size):
      zeros = [0] * (size - len(vec))
      vec.extend(zeros)
      return vec
    
    def sample_predict(sentence, pad):
      encoded_sample_pred_text = encoder.encode(sample_pred_text)
    
      if pad:
        encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
      encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
      predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))
    
      return (predictions)
  • These methods allow us to predict text with or without padding. Text padding can be important in some applications, so it is something we will want to test. The importance of padding document vectors is typically related to the size of documents.

Tip

You will likely notice that the validation accuracy does not keep up with the accuracy of model training. Remember that this is a sign of overfitting, and recall the techniques we can use to fix it. Hint: Dropping out is always an option…

  • Next, we can run the prediction on a new unseen movie review by running the following code:

    sample_pred_text = ('The movie was outrageous. The story was cool but
                        graphics needed work. I would recommend this movie.')
    predictions = sample_predict(sample_pred_text, pad=False)
    print (predictions)
  • Feel free to enter your own review text and run the cell to see the model’s prediction.

Tip

If you find the examples keep disconnecting before training finishes, you can use this hack to keep the session connected. Type F12 to open the developer tools in your browser. Locate the console and enter the following JavaScript:

function reconnect(){
   console.log("Reconnecting");
   document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(reconnect,60000)

The setInterval function calls the reconnect function every 60,000 milliseconds, or once a minute. The reconnect function returns the connect button from the notebook and clicks it.

You may notice that the sentiment detection is a bit weak or just wrong in some cases. There may be a host of reasons for this, including the size of the LSTM layers, the amount of training, and other factors. In order to improve on this example, we need to look at using more or different recurrent layers.

RNN Variations

There are plenty of variations to recurrent networks, as we began to see in the last example. Each type of recurrent layer has its strength and weaknesses and may or may not work for your NLP application. The following list summarizes the major types and subtypes of recurrent networks:

Simple RNN

The simple model we defined earlier, and considered the base unit. It works well for simpler examples and certain edge cases, as we will see.

LSTM (long short-term memory)

A recurrent network layer/cell that attempts to solve vanishing gradients by introducing a gate at the end.

GRU (Gated Recurrent Unit)

Addresses the same problem as LSTM, vanishing gradients, but does so without the gate. The GRU tends to have better performance on smaller datasets compared to LSTM.

Bidirectional RNNs

A subtype of recurrent network that allows it to process forward and backward, which is useful for text classification, generation, and understanding.

Nested inputs/outputs

This is another subtype that allows for the definition of custom networks down to the cell level. It’s not something you will likely want to use right away.

Let’s take a look at how we can improve on our last example by changing out the layers on our model and retraining our movie-text sentiment classifier in Example 4-5.

Example 4-5. Going deeper with RNN
  • This sample uses Example 4-4 as the base for all the code. The only part of this example that is different is the construction of the model.

  • Open up the sample Chapter_4_GRU.ipynb and review the code. Scroll down to where the model is built, as the code is shown below:

    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(encoder.vocab_size, 64),
        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32,
    		return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(16)),
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    model.summary()
    #outputs
    Model: "sequential_3"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    embedding_3 (Embedding)      (None, None, 64)          523840
    _________________________________________________________________
    bidirectional_3 (Bidirection (None, None, 64)          18624
    _________________________________________________________________
    bidirectional_4 (Bidirection (None, 32)                2592
    _________________________________________________________________
    dense_6 (Dense)              (None, 16)                528
    _________________________________________________________________
    dense_7 (Dense)              (None, 1)                 17
    =================================================================
    Total params: 545,601
    Trainable params: 545,601
    Non-trainable params: 0
  • Notice that the model has fewer parameters and the structure is quite different. We use two recurrent layers, both with bidirection, in this example. The first layer is a new GRU, which is followed by a second recurrent SimpleRNN layer. The simple layer takes the output from the GRU layer and trims it down so it fits into the first Dense layer. Notice that we use lower counts of input neurons in both RNN layers.

  • We can then compile and run the model with the following code:

    model.compile(loss='binary_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(1e-4),
                  metrics=['accuracy'])
    
    history = model.fit(train_dataset, epochs=10,
                        validation_data=test_dataset,
                        validation_steps=30)
  • The model compilation and fit function calls are identical for this example. This example can take several hours to run even if set to GPU mode.

Now that you understand the basics of NLP with embeddings and recurrent networks, we can move on to using the Google API. In the next section, we look at how to use the Translation API to translate text.

Neural Translation and the Translation API

There are a variety of reasons why we would want to understand or need to process text. One obvious use of this technology is translation between languages. We don’t yet have the universal translator from Star Trek, but deep learning is getting us closer to that goal every day. The process of translation is a subset of NLP that learns by learning sequences with recurrent networks. However, instead of having a sequence of review text, we learn transformations from one language to another.

This process of learning transformation from one language or set of text to another is based on sequence-to-sequence learning. In the next section, we introduce sequence-to-sequence learning and look at how to build a simple machine translation model.

Sequence-to-Sequence Learning

We have already learned that RNNs can learn sequences or word association and context. What if we could teach a model or models to encode sequences of one type and decode them as another type? This encoding/decoding model is the basis for something called an autoencoder. Autoencoders can be used to extract features from complex data to generate new data, not unlike a GAN. The embedding layers we looked at earlier are based on an autoencoder architecture. We’ll learn more about autoencoders and GANs in Chapter 7.

For this application, though, we want our model to encode and remember one sequence, or context of tokens. Then we will build a decoder that can learn to transform those sequences to other sequences. There are many applications for this type of learning, and the one we focus on here is for machine translation. This encoder/decoder concept is shown in Figure 4-6. You can see in the diagram how text from one sequence is converted to another sequence.

sequence-to-sequence Learning Visualized
Figure 4-6. Sequence-to-sequence learning visualized

In the figure, the text is translated from English (“Hi there, how are you?”) into Spanish (“Hola, cómo estás?”). This example text was converted with Google, and it is quite unlikely we could produce similar results building our own model. As we will see later, though, we can borrow the models Google has developed. Instead, it will be helpful for us to review the code or the standard Keras version of sequence-to-sequence learning that follows:

from keras.models import Model
from keras.layers import Input, LSTM, Dense

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

That block of code was extracted right from the Keras website’s section on sequence-to-sequence learning. It demonstrates the construction of the encoder model and the decoder model. Both encoding/decoding models are constructed from an LSTM layer. These submodels, if you will, are combined into a larger single model. Keep in mind that this is just one part of the code and doesn’t include other things like tokenization, embedding, and understanding.

Note

There are plenty of online materials that can help you understand sequence-to-sequence learning in more detail. Be sure to check out those resources if you want to build your own sequencer.

Getting sequence-to-sequence learning to actually process understandably translated text is outside the scope of this book. Instead, in the next section we will look at how to use the Translation API to do the same thing but more effectively and more easily.

Translation API

If anyone can build a universal translator, it is most certainly Google. Being the go-to archivist for all human media has distinct advantages. Google has been doing raw translation the longest of any search or other platforms, and it is very much a part of its current business model. Therefore, unless you need to translate some rare language (Klingon is not currently supported), then building your own sequence-to-sequence translator isn’t practical.

Fortunately, Google provides the AutoML Translation engine for building custom models on your own pair translations. We will look at that engine shortly. Before that, though, let’s look at how to use the Translation API in Example 4-6.

Warning

The next couple of exercises require you to have billing enabled on your Google account. As long as you use these services reasonably, you can get away with no charges or just minimal charges. That said, be careful, as experiments can go awry. Be sure to shut down notebooks when you are done with them. We would hate to see you get a huge bill because your Klingon translator repeated translations, or some other silly bug.

Example 4-6. Translating with the Translation API
  • Open sample Chapter_4_Translate.ipynb and follow along.

  • The first thing we need to do is acquire a developer key to authorize ourselves on the sample. Click the link in the text for the API developer console. From there you will need to create a new API key and authorize it for development.

  • Create a new developer API key or use an existing one.

  • Run the code in the first code block, with the following code:

    import getpass
    
    APIKEY = getpass.getpass()
  • Running that block of code will generate a key pass text box. The box is waiting for you to enter the newly generated developer API key.

  • Copy and paste the key from the developer console into the key pass text box. There won’t be much fanfare, but if there are no errors, you are in.

  • Next, we enter the code that grabs the translation service and translates the text, as shown here:

    from googleapiclient.discovery import build
    service = build('translate', 'v2', developerKey=APIKEY)
    
    # use the service
    inputs = ['hello there, how are you',
              'little rabbit foo foo hopped through the forest',
              'picking up all the field mice']
    outputs = service.translations().list(source='en', target='es',                                                q=inputs).execute()
    # print outputs
    for input, output in zip(inputs, outputs['translations']):
      print(u"{0} -> {1}".format(input, output['translatedText']))
      #outputs
    hello there, how are you -> Hola cómo estás
    little rabbit foo foo hopped through the forest -> pequeño conejo foo foo
    		esperaba a través del bosque
    picking up all the field mice -> recogiendo todos los ratones de campo
  • This code first takes the developer API key we previously set and authorizes the service. Then it sets up the text we want to translate and does the translation. In this case, we use the Spanish es, but many languages are available. After that, we output the results of the translated text. Yep, so easy.

That service is just so easy to use it begs the question: why would you ever need to use a different translation method? Well, as it turns out, translation is important for things other than languages. Machine translation has been shown to translate program code, domain knowledge, and terms to recipes. With that in mind, we look at how to use AutoML Translation in the next section.

AutoML Translation

In order to be complete, we will demonstrate how to use the AutoML translation engine, which will allow you to upload and train your own language pairings. This API is not a generic sequence-to-sequence learner, and you are limited to working with languages. The API further limits you to translating from one recognized language to another. You cannot, for instance, do an English-to-English model since the sequencer is trained on languages. This can limit your ability to do any specialized domain-specific language translation, if that is your interest. However, if you just need to do specialized language translation, AutoML is probably the place for you.

Warning

AutoML services from Google are paid services that charge a premium for use. That means any use of these services will incur a charge.

In Example 4-7, we walk through the workflow for setting up and using the AutoML Translation service to create a new model.

Example 4-7. Using AutoML for translation
  • Point your browser at the AutoML translation page.

  • After the page loads, click on the Datasets menu item on the left side.

  • Click Create Dataset to create a new dataset.

  • A wizard-type dialog will open, directing you to name the dataset and choose your source and translation languages. Again, these languages cannot be the same, unfortunately.

  • After selecting the languages, click Continue to go to the Import tab, as shown in Figure 4-6. From here you can upload tab-separated data pairs. The following is an example of how this data may look in text:

    hello there, how are you\tqavan pa' chay'
    can you understand this language\tlaH Hol Dayaj'a'
    what language is this?\tnuq Hol?
  • You have the option of uploading any language pairings. In the above example, one side of the tab character (\t) is English, and the other side is Klingon. The data you upload will be broken out into test, training, and validation. For this reason, you need a minimum of 100 items for test and validation. The default split is 80% training, 10% testing, and 10% validation, which means you will want a minimum of 1,000 unique training pairs.

  • When you have uploaded all your data, click Continue to process the data. As the data is processed into training pairs, it will identify sentences.

  • From here you can click on the Train tab to start training the model. Remember that auto machine learning is not only training models but also iteratively searching for optimum model hyperparameters. As such, training times can be substantial.

  • When your model has completed training, you can access predictions from the Predict tab, as shown in Figure 4-7.

You now have enough knowledge to decide on what solution to use, either the Translation API, AutoML Translate, or you can build your own sequence-to-sequence learner. Keep in mind, though, that all these fields are still exploding in research and development, and unless you are one of those researchers, you may want to stick with the Translate API or AutoML solutions. They are much easier to use, and in most cases, they are more practical than building and training your own translation models—that is, unless you need to do some form of custom or unsupported translation. In the next section, we change gears from translation to understanding text.

Importing Datasets into AutoML
Figure 4-7. Importing datasets into AutoML

Natural Language API

The real goal of any AI agent or application is to be able to naturally interface with us humans. Of course, the best way to do that is with natural language understanding, or NLU. While NLP is designed for the bare processing of text, NLU works to understand the concepts behind the text and even read between the lines. However, sarcasm and other emotionally latent responses are still a mystery. Our ability to give a machine understanding of text through deep learning is improving every day.

We already performed sentiment analysis on movie reviews with good success. Now we will apply sentiment analysis on text to determine if it is either positive or negative. This would require a substantial corpus for training, but fortunately Google has already done that with the Natural Language API. Realize that building our own custom model can take extensive effort to improve on, so we will look at using the Natural Language API in Example 4-8.

Example 4-8. Natural Language for sentiment analysis
  • Open Chapter_4_NL_Sentiment.ipynb and run the first code cell. This will prompt you to enter your developer API key, which is the same key you acquired from the developer console credentials section in the last exercise.

  • The first block of code we need to run, shown here, just sets up the sample:

    from googleapiclient.discovery import build
    lservice = build('language', 'v1beta1', developerKey=APIKEY)
    
    quotes = [
      'I'm afraid that the following syllogism may be used by some in the future.
       Turing believes machines think, Turing lies with men Therefore
       machines do not think', # Alan Turing
      'The question of whether a computer can think is no more interesting
       than the question of whether a submarine can swim.', # Edsger W. Dijkstra
      'By far the greatest danger of Artificial Intelligence is that people
       conclude too early that they understand it.', # Eliezer Yudkowsky
      'What use was time to those who'd soon achieve Digital Immortality?',
       # Clyde Dsouza
      'Primary aim of quantum artificial intelligence is to improve human
       freedom, dignity, equality, security, and total well-being.',
       # Amit Ray
      'AI winters were not due to imagination traps, but due to lack of
       imaginations. Imaginations bring order out of chaos. Deep learning
       with deep imagination is the road map to AI springs and AI autumns.'
       # Amit Ray
    ]
  • The service is constructed from the call to build, which we import in the first line. We pass in the language and model type. Then we create a list of quotes about AI and thinking machines. We will use these quotes to understand the sentiment.

  • With setup complete, we’ll move to the next block of code. This block loops through the quotes and sends them to the Natural Language service. The service responds with the analyzed sentiment. After that, the sentiment statistics are output, as shown in the following code:

    for quote in quotes:
      response = lservice.documents().analyzeSentiment(
        body={
          'document': {
             'type': 'PLAIN_TEXT',
             'content': quote
          }
        }).execute()
      print(response)
      polarity = response['documentSentiment']['polarity']
      magnitude = response['documentSentiment']['magnitude']
      print('POLARITY=%s MAGNITUDE=%s for %s' % (polarity, magnitude, quote))
  • Notice in the code that we construct the request body in JSON. JSON, or JavaScript Object Notation, is a format we commonly use to describe objects. If you are unfamiliar with JSON, check out Google for some resources. JSON will be a key element in how we make and receive these requests/responses.

  • You can view the sentiment of each quote by looking at the score and magnitude values for each quote and each sentence in a quote. For instance, we can see that the first quote by Turing is negative, with a score of –0.2 and magnitude of 0.7, while the last quote by Ray has a positive sentiment of 0.2 and magnitude of 1.0. You may get slightly different results.

Analyzing sentiment is just one capability of the Natural Language API. Natural Language also allows you to do a wide variety of applications, from entity extraction to syntax checking. The following is a full list outlining each application:

Sentiment

Sentiment analysis allows us to determine if a message is good or bad.

Entities

Extracting and identifying entities, things, or objects in a sentence is nontrivial. Entity extraction allows you to identify key entities in text and documents. You can then use this information to augment a search, classify documents, or understand a discussion subject.

Syntax

This allows you to inspect the grammar of a sentence. You can determine the voice used and the tense (past, present, or future) for every token. It is not as powerful as Grammarly or the Hemingway app, but it’s a start.

Syntax and entities

This process allows you to extract both sentiment and entities from a document. This one is quite powerful for the right application.

Classify content

Classifying content is similar to entity extraction, with a slight twist. It classifies content based on multiple tokens or words.

Each of these methods is documented more thoroughly on the Google AI Hub for Natural Language API. From here we can move on to another example where we extract entities. Entity analysis and extraction from text has broad applications, from optimizing search to helping a chatbot respond better. This method can be combined with semantic analysis or run in parallel with other analysis types. Imagine being able to identify the semantics, entities, and syntax of any text message or document. We will look at extracting the entities from text in Example 4-9.

Example 4-9. Natural Language for entity analysis
  • Open example Chapter_4_NL_Entities.ipynb. The first code cell again needs to be run to set up security. Be sure to copy your API key from the developer console to your key pass text box and type return.

  • Next, we do the standard import, as seen in the code here:

    from googleapiclient.discovery import build
    lservice = build('language', 'v1beta1', developerKey=APIKEY)
  • Next, we set up the same list of quotes. We won’t show those here. You can also use your own phrases or quotes if you like.

  • The code to run the service and output just basic results is shown here:

    for quote in quotes:
      response = lservice.documents().analyzeEntities(
        body={
          'document': {
             'type': 'PLAIN_TEXT',
             'content': quote
          }
        }).execute()
      print(response)
      #outputs - shortened
      {'entities': [{'name': 'syllogism' ...
      {'entities': [{'name': 'question' ...
      {'entities': [{'name': 'people' ...
      {'entities': [{'name': 'use' ...
      {'entities': [{'name': 'aim' ...
      {'entities': [{'name': 'learning' ...
  • If you look through the entities output, you can see some interesting results. You will note the entry for Turing also associates a Wikipedia article about him, possibly pointing to a source of the quote itself.

Natural language analysis used to be done primarily with regex expressions looking for preset rule matches. Companies would build rules engines that allowed them to correlate and qualify their data. With the Natural Language API, regex rules engines are quickly becoming a thing of the past. We will revisit the Natural Language API in future chapters. For now, we move on to exploring the top model in NLP and NLU: BERT.

BERT: Bidirectional Encoder Representations from Transformers

Bidirectional Encoder Representations from Transformers, or BERT, is the state of the art in NLP and NLU. It was so state-of-the-art that Google considered it a global security threat. Researchers at Google feared that BERT could be used to generate fake news, or worse, fake commands or orders. Imagine an AI capable of tweeting believable yet fake content to the world!

Note

We use the term state of the art (SOA) to refer to the latest peak in model research for a given task. SOA for a model will then be the baseline for all future models built to handle similar tasks.

The fear of BERT being used for nefarious purposes has subsided, likely because, while the model is good and is closing the gap, it is still far off from actual human communication. Now, Google and other vendors have released BERT and other variations to the public as open source.

Through that release, NLP researchers could see how much the landscape of NLP has changed in just a few short years. BERT introduces a number of new concepts to NLP and refines some old ones as well, as we will see. Likewise, the NLP techniques it introduces are now considered state of the art and are worth our focus in this section.

BERT works by applying the concept of bidirectional encoding/decoding to an attention model called a Transformer. The Transformer learns contextual relations in words without the use of RNN or CNN but with an attention-masking technique. Instead of reading a sequence of text like we have seen before, BERT consumes the entire text all at once. It uses a masking technique called Masked LM (MLM) that randomly masks words in the input and then tries to predict them. This provides a bi⁠directional view of the context around a word. The following is a summary of the main techniques used by BERT:

Masking (Masked LM)

Masking adds a classification layer on top of the encoded input. This classifier is used to extract the probability of pairwise matching of words, essentially providing the context of a word/token in the entire input. This removes the need to extract context using RNN or even CNN. However, it does require the input to be positionally encoded to preserve spatial context. With BERT, this masking technique provides bidirectionality. This is unique to Google’s BERT, whereas the OpenAI GPT model, similar to BERT, uses a like mechanism that is not bidirectional.

Positional encodings

MLM requires PE to be encoded in a number of ways:

  • By encoding the position of the word in the sentence.

  • By learning not only the pairing of words, but also how they combine to create sentences. This is called segment or sentence embedding.

  • By using a word embeddings layer to encapsulate the importance of the word in the input.

Next sentence prediction

BERT provides the ability to predict large parts or segments of a document using MLM. However, it was also trained on sentence pairings, making it even more capable of predicting the next sentence or segment. This ability opened up further interesting pairings, such as giving BERT an answer and expecting the question. This fed the speculation that BERT could be used to develop believable fake news or other communication for nefarious purposes.

With these advances, BERT has completely reformulated NLP overnight. The recurrent network layers we covered earlier are now being replaced by attention transformative networks. We will cover the details of this change in Chapter 5, where we take a far more in-depth look at how to build a BERT model. For the rest of this chapter, we will look at how to train a BERT model to do inference on various tasks. In the next section, we’ll discuss the first task of building a semantic model with BERT.

Semantic Analysis with BERT

We have already built a semantic analysis network model to classify the sentiment of IMDb-style movie reviews. This model worked relatively well but may have lacked some finesse. For instance, our previous model would likely miss sarcastic movie reviews like “His performance was great, if he was playing a rock.” Sarcasm and other language idioms are difficult to capture semantically. BERT is able to learn these difficult idioms more capably and quickly than recurrent networks. Recurrent networks, by definition of learning sequences by passing state into deeper layers, are computationally expensive. This is in contrast to BERT, which uses the bidirectional attention transformation mechanism that is much more performant. That performance increase will become very apparent when we train BERT to learn sentiment.

TF Hub provides a service for BERT, but we won’t use that here. Instead, we will use a module called ktrain. ktrain is a Python module built on top of Keras/TensorFlow that provides a number of helpful trainers for image and text recognition. We will use more of ktrain later for image analysis in video, but for now we will look at using it for text in Example 4-10.

Example 4-10. BERT sentiment analysis
  • Open up Chapter_4_BERT_IMDB.ipynb and follow the code. Since this model is not provided by TF Hub, we can skip the security step in earlier examples.

  • We will start by installing the ktrain module into Colab with the following command:

    !pip3 install ktrain
  • Next, we import ktrain and the text module, as shown in the following code:

    import ktrain
    from ktrain import text
    
    ktrain.__version__
  • ktrain encapsulates all the modules needed for processing and training models, so you only need to import it. This greatly simplifies developing and training complex models, which in turn will greatly ease our development of higher-functioning applications.

  • The bulk of the code we need for this example is for loading the data itself, shown here:

    import tensorflow as tf
    dataset = tf.keras.utils.get_file(
        fname="aclImdb.tar.gz",
        origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
        extract=True,
    )
    
    # set path to dataset
    import os.path
    #dataset = '/root/.keras/datasets/aclImdb'
    IMDB_DATADIR = os.path.join(os.path.dirname(dataset), 'aclImdb')
    print(IMDB_DATADIR)
  • Running that block of code downloads the IMDb movie review dataset to a local path. ktrain works by being fed the source path of the test and training data, so that’s all we need to do.

  • Next, we load the data and preprocessor. We assign it the mode, in this case BERT, names, and classes, as shown in the code here:

    (x_train, y_train), (x_test, y_test), preproc =
        text.texts_from_folder(IMDB_DATADIR,
            maxlen=500,
            preprocess_mode='bert',
            train_test_names=['train','test'],
            classes=['pos', 'neg'])
  • With data loaded, the model is built and a learner is constructed. Notice that we limit the length of input to 500 characters. This will simplify our model and make it quicker to train. Next, we create a learner that sets the model up for training and then performs the training/validation fit, as shown in the following code:

    model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)
    learner = ktrain.get_learner(model,train_data=(x_train, y_train),
                                 val_data=(x_test, y_test), batch_size=6)
    
    learner.lr_find()
    learner.lr_plot()
  • At the end of the block, the learner first finds the appropriate learning rate and then plots out the loss of this search.

  • From there, we need to train the model using a given threshold and number of epochs. These samples can take a while, so take note that training times might be longer. The code to autofit the model is shown here:

    learner.autofit(2e-5, 1)
  • The last block of code will take over an hour to run on a CPU notebook and slightly less on a GPU notebook. This is significantly better than our previous semantic analysis example by an order of magnitude. Our previous examples could take hours with a GPU. You may want to use our previous reconnection hack from the end of Example 4-7 to keep the Colab session running without disconnecting.

  • Finally, we review the plot of the training output and then do a sample prediction with the following code:

    learner.lr_plot()
    
    predictor = ktrain.get_predictor(learner.model, preproc)
    
    data = [
      'I am glad the popcorn was good, but my seat was still uncomfortable.',
      'That actor was fantastic, if he were a rock.',
      'Graphics were amazing but the story lacked character.',
      'This movie made me think why do things matter?',
           ]
    
    predictor.predict(data)
    #outputs
    ['neg', 'pos', 'neg', 'pos']
  • Retraining the BERT model for a longer time would likely improve the results.

  • Figure 4-8 shows how well the BERT model is able to adapt to our new task, movie-review sentiment analysis.

Output of BERT Model Training Loss
Figure 4-8. Output of training loss from BERT model

You can see from the output of the last exercise that the model does a surprisingly good job learning to distinguish language nuances. Keep in mind that the model we are using is a BERT-trained model that we are applying transfer learning to. The amount of training we did for this example is about the minimum, and you should expect your results to improve with more training. This is the same process used with transfer learning, where we retrained a previously trained model to recognize cats and dogs. For the sake of keeping things simple, we used ktrain, a nice, compact helper library, to download and set up the BERT model for us. We will continue to use ktrain in the next section to build a documentation system.

Document Matching with BERT

After looking at sentiment, we can move on to determining document similarity with BERT. We’ve already seen how we could use bag-of-words and TF-IDF vectors to compare document similarity. This allows you to find similar content more easily in things like search or document matching. In Example 4-11, we retrain BERT to classify documents into topics or categories.

Example 4-11. BERT document analysis
  • For this exercise, we are going to retrain BERT to identify similar documents defined from previously set categories. The dataset we are using in this exercise, 20newsgroups_dataset, is from the scikit-learn tutorial series.

  • Open the example Chapter_4_BERT_DOCs.ipynb and make sure the runtime type is set to GPU.

  • The first code block defines the basic environment setup and install of ktrain. We won’t review that here, but be sure to take a look at it.

  • Jumping to the next block of code, we can see that the categories are getting selected. In this example, we use only four categories provided by the sample dataset. This will greatly reduce our training times. From there, we load the data using sklearn. sklearn has become a standard for various machine learning exercises like this one. The complete code is shown here. The last lines show how the training and test data is set to the various lists:

    categories = ['comp.graphics', 'misc.forsale',
                 'sci.space', 'rec.sport.hockey']
    from sklearn.datasets import fetch_20newsgroups
    train_b = fetch_20newsgroups(subset='train',
       categories=categories, shuffle=True, random_state=42)
    test_b = fetch_20newsgroups(subset='test',
       categories=categories, shuffle=True, random_state=42)
    
    print('size of training set: %s' % (len(train_b['data'])))
    print('size of validation set: %s' % (len(test_b['data'])))
    print('classes: %s' % (train_b.target_names))
    
    x_train = train_b.data
    y_train = train_b.target
    x_test = test_b.data
    y_test = test_b.target
    #outputs
    size of training set: 2362
    size of validation set: 1572
    classes: ['comp.graphics', 'misc.forsale', 'rec.sport.hockey', 'sci.space']
  • With the data loaded, we can move on to building the training and testing datasets with the following code:

    import ktrain
    from ktrain import text
    
    (x_train,  y_train), (x_test, y_test),
    preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                    x_test=x_test, y_test=y_test,
                                    class_names=train_b.target_names,
                                    preprocess_mode='bert',
                                    ngram_range=1,
                                    maxlen=350,
                                    max_features=35000)
  • Then we build, train, and plot the model losses with the following code:

    model = text.text_classifier('bert', train_data=(x_train, y_train),
                                 preproc=preproc)
    
    learner = ktrain.get_learner(model, train_data=(x_train, y_train),
                                 batch_size=6)
    
    learner.lr_find()
    learner.lr_plot()
  • The last section of code finds the learning rates, but the next section of code optimizes the learning:

    learner.autofit(2e-5, 5)
  • After the model has been autofit, we move on to extracting a predictor and then confirming our classes with the following code:

    predictor = ktrain.get_predictor(learner.model, preproc)
    predictor.get_classes()
    #outputs
    ['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']
  • With the predictor object, we can then run a prediction of some text of interest using the code here:

    predictor.predict('Deep learning is to science as humor is to laughter.')
  • The output class will depend on the amount of time you trained the model for. With more training, the class may make more or less sense. Keep in mind that we have limited the number of topics in our trainer, which in turn limits our ability to recognize document types.

Note

You can also switch your Colab notebook to use an offline (local notebook) by running Jupyter on your local computer. To switch your runtime connection, click the runtime info dropdown, located at top right, and select the connection type. The dialog will also guide you through setting up and configuring this further.

These examples can take a considerable amount of time, so it usually helps to be particularly careful with your setup. Of course, after you fumble your first 12- or 24-hour training session, you begin to learn very quickly. Regardless of our simplistic results, it should be apparent how accessible it is to train NLP with BERT and ktrain. In the next section, we look at a more custom example that illustrates how to source and load data.

BERT for General Text Analysis

We have seen how to use BERT a couple of different ways on prepared datasets. Prepared datasets are excellent for learning, but they’re not practical in the real world. Therefore, in Example 4-12 we look at how to use BERT to do some general text analysis.

Example 4-12. BERT text analysis
  • In this exercise we are going to use jokes as our text analysis source. Jokes and humor can be difficult to determine for us humans, so this will be a great demonstration of BERT. The jokes we will use are from a GitHub repository. We’ll be looking at the stupidstuff.org joke list in particular. Note: Some of these jokes should be considered NSFW.

  • Open the example Chapter_4_BERT_Jokes.ipynb. The first cell starts by installing ktrain and wget. We will use wget for pulling down the JSON document of joke text.

  • Then we jump down to downloading and saving the raw JSON joke text using wget with the following code:

    import wget
    
    jokes_path = "stupidsuff.json"
    url = 'https://raw.githubusercontent.com/taivop/joke-dataset/master/
    stupidstuff.json'
    wget.download(url, jokes_path)
  • With the file downloaded, we can open the file and parse the JSON into our jokes and humor lists. The humor list will hold the funny class 0 if the joke is not funny and 1 if it is. In the code block, we set the fun_limit variable to 4. This value will set the rating at which we decide a joke is funny. The ratings for this dataset go from 1 to 5, so 4 is about 80%. The following is the code to load and parse the JSON into the needed data lists:

    import json
    
    jokes = []
    humor = []
    fun_limit = 4
    fun_total = 0
    not_total = 0
    with open(jokes_path) as json_file:
        data = json.load(json_file)
        for d in data:
          joke = d['body']
          jokes.append(joke)
          if d['rating'] > fun_limit:
            humour.append(1)
            fun_total += 1
          else:
            humor.append(0)
            not_total += 1
    print(jokes[244], humor[244])
    print(fun_total, not_total)
  • You can adjust the fun limit to any value you want, but be sure you get a good split of data. The last line in the last code block will output the ratio of funny to not funny. Ideally we would want the data split about 50/50. This is often quite difficult to do, however, so we will just use a value that doesn’t create too large a difference.

  • We load the joke data into the X set of inputs. Then the Y set is defined by the humor list. We now need to break these sets up into training and test sets using the following code:

    cut = int(len(jokes) * 0.8)
    x_train = jokes[:cut]
    x_test = jokes[cut:]
    y_train = humor[:cut]
    y_test = humor[cut:]
  • The value we use to determine the cut value (.8) determines the split percentage in our dataset. With .8, we are using 80% of the data for training and therefore 20% for testing.

  • Then we prep the model with the following code:

    import ktrain
    from ktrain import text
    
    (x_train,  y_train), (x_test, y_test),
    preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                    x_test=x_test, y_test=y_test,
                                    class_names=['not','funny'],
                                    preprocess_mode='bert',
                                    ngram_range=1,
                                    maxlen=500,
                                    max_features=35000)
  • This is similar to the code we saw prepping the data. This helper preps our raw text input as defined by the options. Notice the changes we use here for ngram_range, maxlen, and max_features.

  • Next, we retrain the model with the following single line of code:

    learner.autofit(2e-5, 1)
  • After that we extract the predictor and get the classes again with this code:

    predictor = ktrain.get_predictor(learner.model, preproc)
    predictor.get_classes()
  • With the predictor, we can cast a prediction with the following code:

    predictor.predict('A man walked into a bar. The bartender yelled back,
                      get out, we're closed.')
    #outputs
    'not'
  • You can of course try other humorous statements, questions, or whatever and determine if it finds the text funny.

Of course, training the last example for more epochs or on a larger corpus would provide much better results, generally. The same repository we used for the source of this example also provides a Reddit source of almost 200,000 jokes. Training on that source would likely be more interesting but would certainly require much more time and resources.

While training BERT is more efficient than recurrent network NLP systems, it does require a large corpus in order to learn since it still encapsulates a large model of parameters. However, with tools like ktrain, we can retrain BERT in a relatively short time on more specialized tasks. This should allow for the development of more robust custom models tackling real-world problems in the near future.

Conclusion

As a society we have been on a rigorous path to making our interface with computers more natural, and being able to understand language and text is a big part of that. NLP has started to reshape the world in many ways, and this in turn has shaped our ability to interface with complex systems. We can now read and process the reviews of thousands or millions of users almost instantly, giving businesses the ability to understand how people are using or consuming a commercial service instantly. Google has given us the ability to open services that can process text in a number of ways, from translation to just understanding the entities or syntax of a document. Google further enhanced NLP by open sourcing its controversial BERT model, with pretrained BERT models now able to be retrained on everything from sentiment to document similarity, and perhaps even jokes.

In the next chapter we take NLP a step further and introduce you to building chatbots. Chatbots come in all flavors, from assistants like Siri or Alexa to fictional conversational models like the one featured in the movie Her. We will use chatbots as a way to build an interface into a new ML assistant framework.

Get Practical AI on the Google Cloud Platform now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.