The sequence-to-sequence architecture is based on a paper called sequence to sequence—Video to Text authored by Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. The paper can be located at https://arxiv.org/pdf/1505.00487.pdf.
In the following diagram (Figure 5.3), a sequence-to-sequence video-captioning neural network architecture based on the preceding paper is illustrated:
The sequence-to-sequence model processes the video image frames through a pre-trained convolutional ...