In hierarchical softmax, instead of mapping each output vector to its corresponding word, we consider the output vector as a form of binary tree. Refer to the structure of hierarchical softmax in Figure 6.34:
So, here, the output vector is not making a prediction about how probable the word is, but it is making a prediction about which way you want to go in the binary tree. So, either you want to visit this branch or you want to visit the other branch. Refer to Figure 6.35: