Chapter 7: Captioning Images with CNNs and RNNs

Equipping neural networks with the ability to describe visual scenes in a human-readable fashion has to be one of the most interesting yet challenging applications of deep learning. The main difficulty arises from the fact that this problem combines two major subfields of artificial intelligence: Computer Vision (CV) and Natural Language Processing (NLP).

The architectures of most image captioning networks use a Convolutional Neural Network (CNN) to encode images in a numeric format so that they're suitable for the consumption of the decoder, which is typically a Recurrent Neural Network (RNN). This is a kind of network specialized in learning from sequential data, such as time series, video, and ...

Get TensorFlow 2.0 Computer Vision Cookbook now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.