A successful image captioning system needs a way to translate a given image into a sequence of words. For extracting the right and relevant features from images, we can leverage a DCNN and, coupled with a recurrent neural network model, such as RNNs or LSTMs, we can build a hybrid generative model to start generating sequences of words as a caption, given a source image.
Thus, conceptually, the idea is to build a single hybrid model that can take a source image I as input, and can be trained to maximize the likelihood, P(S|I), such that S is the output of a sequence of words, which is our target output, and can be represented by S = {S1,S2, ..., Sn}, such that each word Sw comes from a given dictionary, which is our vocabulary. ...