Going from RNNs to Transformers
Attention provides a shortcut between source and target sequences. This helps a recurrent model learn longer sequences by highlighting which portions of the context vector map to portions of the target sequence. The learned attention matrix encodes information about the relationships between tokens in source and target sequences. But you still need to rely on a recurrent network for extracting context from the source sequence.
Recall from Chapter 9, Understand Text, that it’s common practice to extract the last token from a sequence transformed by an RNN such as an LSTM. This is because this token contains a representation that encodes information about the entire sequence. The sequential nature of RNNs restricts ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access