RNNs can also work with problems that require fixed input to be transformed into a variable sequence. Image captioning takes in a fixed input picture, and outputs a completely variable description of that picture. These models utilize a CNN to input the image, and then feed the output of that CNN into an RNN, which will generate the caption one word at a time:
We'll be building a neural captioning model based on the Flicker 30 dataset, provided by the University of California, Berkley, which you can find in the corresponding GitHub repository for this chapter. In this case, we'll be utilizing pretrained image embeddings ...