August 2018
Intermediate to advanced
438 pages
12h 3m
English
For generating predictions from our deep learning-based neural image captioning model, remember that it is not as straightforward as a basic classification or categorization model. We will need to generate a sequence of words from our model at each time-step based on the input image features. There are multiple ways of generating these sequence of words for the captions.
One approach is known as sampling, or greedy search, where we start with the <START> token, input image features, and then generate the first word based on p1 from the LSTM output. Then we feed in the corresponding predicted word embedding as input and generate the next word based on p2 from the next LSTM (in the unrolled form we talked ...