It is possible to learn the language model and, implicitly, the embedding function via a feedforward fully connected network. Given a sequence of n-1 words (wt-n+1 , ..., wt-1), it tries to output the probability distribution of the next word, wt (the following diagram is based on http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf):
The network layers play different roles, such as the following:
- The embedding layer takes the one-hot representation of ...