The following example will be focusing on building a video question answering model, and we will be using Keras to define the model.
In order to solve this problem, we will retrain it using high-level TensorFlow training in a distributed setting.
As we can see that we have videos which are sampled 4 frames per second and it's roughly 10 seconds per video so we have about 40 frames total per video. And we are asking questions about the video contents, just like the ones that are shown in figure 6.
So we are going to build a deep learning model that will take as an input: