General attention
Although we've discussed the attention mechanism in the context of NMT, it is a general deep-learning technique that can be applied to any seq2seq task. Let's assume that we are working with hard attention. In this case, we can think of the vector st-1 as a query executed against a database of key-value pairs, where the keys are vectors and the hidden states hi are the values. These are often abbreviated as Q, K, and V, and you can think of them as matrices of vectors. The keys Q and the values V of Luong and Bahdanau attention are the same vector—that is, these attention models are more like Q/V, rather than Q/K/V. The general attention mechanism uses all three components.
The following diagram illustrates this new general ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access