Chapter 5. Feature Engineering and Syntactic Similarity

As we saw in Chapter 1, text is significantly different from structured data. One of the most striking differences is that text is represented by words, while structured data (mostly) uses numbers. From a scientific point of view, centuries of mathematical research have led to an extremely good understanding of numbers and sophisticated methods. Information science has picked up that mathematical research, and many creative algorithms have been invented on top of that. Recent advances in machine learning have generalized a lot of formerly very specific algorithms and made them applicable to many different use cases. These methods “learn” directly from data and provide an unbiased view.

To use these instruments, we have to find a mapping of text to numbers. Considering the richness and complexity of text, it is clear that a single number will not be enough to represent the meaning of a document. Something more complex is needed. The natural extension of real numbers in mathematics is a tuple of real numbers, called a vector. Almost all text representations in text analytics and machine learning use vectors; see Chapter 6 for more.

Vectors live in a vector space, and most vector spaces have additional properties such as norms and distances, which will be helpful for us as they imply the concept of similarity. As we will see in subsequent chapters, measuring the similarity between documents is absolutely crucial for most ...

Get Blueprints for Text Analytics Using Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.