Transformer-XL
In this section, we'll talk about an improvement over the vanilla transformer, called transformer-XL, where XL stands for extra long (see Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context at https://arxiv.org/abs/1901.02860). To understand the need to improve the regular transformer, let's discuss some of its limitations, one of which comes from the nature of the transformer itself. An RNN-based model has the (at least theoretical) ability to convey information about sequences of arbitrary length, because the internal RNN state is adjusted based on all previous inputs. But the transformer's self-attention doesn't have such a recurrent component, and is restricted entirely within the bounds of the current ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access