Segment-level recurrence with state reuse

Transformer-XL introduces a recurrence relationship in the transformer model. During training, the model caches its state for the current segment, and when it processes the next segment, it has access to that cached (but fixed) value, as we can see in the following diagram:

Illustration of the training (a) and evaluation (b) of transformer-XL with an input sequence length of 4. Source: https://arxiv.org/abs/1901.02860

During training, the gradient is not propagated through the cached segment. Let's formalize this concept (we'll use the notation from the paper, which might differ slightly from the previous ...

Get Advanced Deep Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.