Even an intra-temporal attention function ensures that, during each decoding step, different parts of the encoded input are attended but the decoder can still generate repeated phrases during long sequences. In order to prevent that, information from the previously decoded sequence can also be fed into the decoder. Information from the previous decoding steps will help the model to avoid repetition of the same information and lead to structured prediction.
In order to accomplish this approach to incorporate the information from previous decoding steps, an intra-decoder attention is applied. This approach is not used in current encoder-decoder models for abstractive summarization. For each time step t while decoding, ...