Conceptual approach

A successful image captioning system needs a way to translate a given image into a sequence of words. For extracting the right and relevant features from images, we can leverage a DCNN and, coupled with a recurrent neural network model, such as RNNs or LSTMs, we can build a hybrid generative model to start generating sequences of words as a caption, given a source image.

Thus, conceptually, the idea is to build a single hybrid model that can take a source image I as input, and can be trained to maximize the likelihood, P(S|I), such that S is the output of a sequence of words, which is our target output, and can be represented by S = {S1,S2, ..., Sn}, such that each word Sw comes from a given dictionary, which is our vocabulary. ...

Get Hands-On Transfer Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.