We have so far solely focused on numerical time series and devoted most of our time to analysing numbers. Recently, however, processing of non-numerical data, namely written text, has become an essential part of machine learning and data science. In this chapter, we introduce some concepts from the data processing field, which are motivated by how search engines assign importance to documents.
27.1 INFORMATION RETRIEVAL
The first method we review is from the information retrieval toolbox. The motivation is as follows. We have a set of basic building blocks, words, and a document composed of a subset of words. We want to assign a measure for the word given how important it is for a given document. Since some words are more frequent than others, the natural frequency of words within the entire corpus of all documents has to be taken into account, and the measure has to be corrected for it.
Imagine we have a document, and we want to understand whether its content is relevant for a reader or not. The relevancy means that it is not just “another” permutation of common words, but it contains unique words. Let us stress that we do base our analysis on the uniqueness of words themselves and not how they are structured within a sentence. Such an idea, if expressed using common words, cannot be captured with analytics we are building now and is beyond the aim of this book.
In this section, we first discuss the corpus we use for examples and then two ...