Backpropagation through time
The other important variable that we see go through the iterator is backpropagation through time (BPTT). What it actually means is, the sequence length the model needs to remember. The higher the number, the better—but the complexity of the model and the GPU memory required for the model also increase.
To understand it better, let's look at how we can split the previous batched alphabet data into sequences of length two:
a g m s
b h n t
The previous example will be passed to the model as input, and the output will be from the sequence but containing the next values:
b h n t
c I o u
For the example WikiText2, when we split the batched data, we get data of size 30, 20 for each batch where 30 is the sequence length. ...