The machine learning pipeline for image caption generation
Here we will look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of three main components and one optional component:
- A CNN generating encoded vectors for images
- An embedding layer learning word vectors
- (Optional) An adaptation layer that can transform a given embedding dimensionality to an arbitrary dimensionality (details will be discussed later)
- An LSTM taking the encoded vectors of the images, and outputting the corresponding caption
First, let's look at the CNN generating the encoded vectors for images. We can achieve this by first training a CNN on a large classification ...