Finding the softmax is computationally quite expensive and prohibitive for a large corpus as this means that we not only have to find the score for the text with the label but the scores for the text with all the labels. So for n text and m labels, this scales with worst case performance of O(n2), which you know is not good.
Also, softmax and other similar methods for finding the probabilities do not take into account the semantically meaningful organization of classes. A classifier should know that classifying a dog as a submarine should have a higher penalty than a dog as a wolf. An intuition that we may have is that the target labels are probably not flat but rather a tree.
So now we have k classes that ...