December 2019
Intermediate to advanced
468 pages
14h 28m
English
In this section, we'll implement multihead attention by following the definitions from the The transformer attention section. We'll start with the implementation of the regular scaled dot product attention:
def attention(query, key, value, mask=None, dropout=None): """Scaled Dot Product Attention""" d_k = query.size(-1) # 1) and 2) Compute the alignment scores with scaling scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # 3) Compute the attention scores (softmax) p_attn = torch.nn.functional.softmax(scores, dim=-1) if dropout is not None: p_attn = dropout(p_attn) # 4) Apply the attention scores over the values return torch.matmul(p_attn ...
Read now
Unlock full access