The models we have seen in this book so far use a bag-of-words decomposition technique, enabling us to explore relationships between documents that contain the same mixture of individual words. This is incredibly useful, and indeed we’ve seen that the frequency of tokens can be very effective particularly in cases where the vocabulary of a specific discipline or topic is sufficient to distinguish it from or relate it to other text.
What we haven’t taken into account yet, however, is the context in which the words appear, which we instinctively know plays a huge role in conveying meaning. Consider the following phrases: “she liked the smell of roses” and “she smelled like roses.” Using the text normalization techniques presented in previous chapters such as stopwords removal and lemmatization, these two utterances would have identical bag-of-words vectors though they have completely different meanings.
This does not mean that bag-of-words models should be completely discounted, and in fact, bag-of-words models are usually very useful initial models. Nonetheless, lower performing models can often be significantly improved with the addition of contextual feature extraction. One simple, yet effective approach is to augment models with grammars to create templates that help us target specific types of phrases, which capture more nuance than words alone.
In this chapter, we will begin by using a grammar to extract key phrases from our documents. ...