Negative sampling
From the loss function, we can see that computing the softmax layer can be very expensive. The cross-entropy cost function requires the network to produce probabilities, which means the output score from each neuron needs to be normalized to generate the actual probabilities of each class (for example, a word in the vocabulary in the Skip-Gram model). The normalization requires computation of the hidden-layer output with every word in the context-word matrix. In order to deal with this issue, Word2Vec uses a technique called negative sampling (NEG), which is similar to noise-contrastive estimation (NCE).
The core idea of NCE is to convert a multinomial classification problem, such as the case of predicting a word based on ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access