Chapter 11. Word Embeddings

Word embeddings are part of distributional semantics, similar to the topic models we discussed in the previous chapter. Unlike topic models, word embeddings do not work on term-document relationships. Instead, word embeddings work with smaller contexts like sentences or subsequences of tokens in a sentence.

The field of word embeddings is a rapidly evolving set of techniques. The most popular technique, Word2vec, was developed in 2013 by Tomas Mikolov et al. at Google. Since then, there has been much research (and hype). The idea is that you use a neural network to build a language model. Once this model is learned, you can take some of the intermediate values in the network as representations of the input term.

In this chapter, we will look at the implementation of Word2vec in code. This will help give us a clear understanding of the fundamentals of this family of techniques. We will discuss the more recent approaches at a higher level because they can be quite resource intensive.

Word2vec

One of the ideas behind deep learning is that the hidden layers are “higher level” representations of the data. This comes from analysis of the visual cortex. As the information travels from the eye through the brain, neurons appear to be associated with more complex shapes. The early layers of neurons recognize only points of light and dark, later neurons recognize lines and curves, and so on. Using this assumption, if we train a language model using a neural network, ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.