Chapter 15. Transformers for Natural Language Processing and Chatbots
In a landmark 2017 paper titled “Attention Is All You Need”,1 a team of Google researchers proposed a novel neural net architecture named the Transformer, which significantly improved the state of the art in neural machine translation (NMT). In short, the Transformer architecture is simply an encoder-decoder model, very much like the one we built in Chapter 14 for English-to-Spanish translation, and it can be used in exactly the same way (see Figure 15-1):
-
The source text goes in the encoder, which outputs contextualized embeddings (one per token).
-
The encoder’s output is then fed to the decoder, along with the translated text so far (starting with a start-of-sequence token).
-
The decoder predicts the next token for each input token.
-
The last token output by the decoder is appended to the translation.
-
Steps 2 to 4 are repeated again and again to produce the full translation, one extra token at a time, until an end-of-sequence token is generated. During training, we already have the full translation—it’s the target—so it is fed to the decoder in step 2 (starting with a start-of-sequence token), and steps 4 and 5 are not needed.
Figure 15-1. Using the Transformer model for English-to-Spanish translation
So what’s new? Well, inside the black box, there are some important differences with our previous ...