Attention is all we need

We just learned how the seq2seq model works and how it translates a sentence from the source language to the target language. We learned that a context vector is basically a hidden state vector from the final time step of an encoder, which captures the meaning of the input sentence, and it is used by the decoder to generate the target sentence.

But when the input sentence is long, the context vector does not capture the meaning of the whole sentence, since it is just the hidden state from the final time step. So, instead of taking the last hidden state as a context vector and using it for the decoder, we take the sum of all the hidden states from the encoder and use it as a context vector.

Let's say the input sentence ...

Get Hands-On Deep Learning Algorithms with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.