20 Attention and Transformers

In Chapter 19 we looked at how to use RNNs to handle sequential data. Though powerful, RNNs have a few drawbacks. Because all of the information about an input is represented in a single piece of state memory, or context vector, the networks inside each recurrent cell need to work hard to compress everything that’s needed into the available space. And no matter how large we make the state memory, we can always get an input that exceeds what the memory can hold, so something necessarily gets lost.

Another problem is that an RNN must be trained and used one word at a time. This can be a slow way to work, particularly ...

Get Deep Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.