The machine learning pipeline for image caption generation

Here we will look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of three main components and one optional component:

  • A CNN generating encoded vectors for images
  • An embedding layer learning word vectors
  • (Optional) An adaptation layer that can transform a given embedding dimensionality to an arbitrary dimensionality (details will be discussed later)
  • An LSTM taking the encoded vectors of the images, and outputting the corresponding caption

First, let's look at the CNN generating the encoded vectors for images. We can achieve this by first training a CNN on a large classification ...

Get Natural Language Processing with TensorFlow now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.