How does the network keep track of the previous states? To put it in the context of text generation, think of our training data as a list of character sequences (tokenized words). For each word, from the first character, we will predict the following:
Formally, let's denote a sequence of t+1 characters as x = [x0, x1, x2, ... , xt]. Let s-1 =0.
For k=0,2,...t, we construct the following sequence:
This is summarized in the following diagram, when input x is received, the internal state, s, of the network is modified, and then used to generate an output, o: