May 2019
Intermediate to advanced
456 pages
11h 38m
English
A simple yet effective way of classifying text is to see the text as a bag-of-words. This means that we do not care for the order in which words appear in the text, instead we only care about which words appear in the text.
One of the ways of doing a bag-of-words classification is by simply counting the occurrences of different words from within a text. This is done with a so-called count vector. Each word has an index, and for each text, the value of the count vector at that index is the number of occurrences of the word that belong to the index.
Picture this as an example: the count vector for the text "I see cats and dogs and elephants" could look like this:
|
i |
see |
cats |
and |
dogs |
elephants |
|---|---|---|---|---|---|
|
1 |
1 |
1 |
2 |
1 |
1 |
In reality, count vectors ...