Hierarchical softmax
Computing the softmax is expensive because for each target word, we have to compute the denominator to obtain the normalized probability. However, the denominator is the sum of the inner product between the hidden layer output vector, h, and the output embedding, W, of every word in the vocabulary, V.
To solve this problem, many different approaches have been proposed. Some are softmax-based approaches such as hierarchical softmax, differentiated softmax, and CNN softmax so on, while others are sampling-based approaches. Readers can refer to http://ruder.io/word-embeddings-softmax/index.html#cnnsoftmax for a deeper understanding of approximating softmax functions.
Softmax-based approaches are methods that keep the softmax ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access