Going from RNNs to Transformers

Attention provides a shortcut between source and target sequences. This helps a recurrent model learn longer sequences by highlighting which portions of the context vector map to portions of the target sequence. The learned attention matrix encodes information about the relationships between tokens in source and target sequences. But you still need to rely on a recurrent network for extracting context from the source sequence.

Recall from Chapter 9, Understand Text, that it’s common practice to extract the last token from a sequence transformed by an RNN such as an LSTM. This is because this token contains a representation that encodes information about the entire sequence. The sequential nature of RNNs restricts ...

Get Machine Learning in Elixir now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.