June 2020
Intermediate to advanced
364 pages
13h 56m
English
These types of attention were created for generating captions for images. A CNN is first used to extract features and then compress them into an encoding. To decode it, an LSTM is used to produce words that describe the image. But that isn't important right now—distinguishing between soft and hard attention is.
In soft attention, the alignment weights that are learned during training are softly placed over patches in an image so that it focuses on part(s) of an image more than others.
On the other hand, in hard attention, we focus only on part of the image at a time. This only makes a binary decision about where to focus on, and it is much harder to train in comparison to soft attention. This is because ...