Finding the softmax is highly computationally intensive. For each training instance, we have to iterate through every word in the vocabulary and compute the softmax. Thus, it is impractical to scale up to large vocabularies and large training corpora. To solve this problem, there are two approaches that are used in fastText: the hierarchical softmax and the negative sampling approach. We will discuss hierarchical softmax in this section and will discuss negative sampling in the next section. In both the approaches, the trick is to recognize that we don't need to update all the output vectors per training instance.
In hierarchical softmax, a binary tree is computed to represent all the words in the vocabulary. The V words ...