Given an input word, the Skip-gram model can predict its context (the opposite of CBOW). For example, the word brown will predict the words The quick fox jumps. Unlike CBOW, the input is a single one-hot word. But how do we represent the context words in the output? Instead of trying to predict the whole context (all surrounding words) simultaneously, Skip-gram transforms the context into multiple training pairs such as (fox, the), (fox, quick), (fox, brown), and (fox, jumps). Once again, we can train the model with a simple one-layer network:
As with CBOW, the output is a softmax, which represents the one-hot ...