The Chapter 7, Image Captioning, illustrated several ways to combine text and image. Similarly, captions can be generated for videos, describing the context. Let's see a list of the datasets available for captioning videos:
- Microsoft Research - Video To Text (MSR-VTT) has 200,000 video clip and sentence pairs. More details can be obtained from: https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/.
- MPII Movie Description Corpus (MPII-MD) can be obtained from: https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/mpii-movie-description-dataset. It has 68,000 sentences with 94 movies.
- Montreal ...