July 2018
Beginner to intermediate
406 pages
9h 55m
English
More robust than edit distance is the so-called bag of words approach. It ignores the order of words and simply uses word counts as their basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. The vector is typically huge as it contains as many elements as words that occur in the whole dataset. The previously mentioned two example posts would then have the following word counts:
|
Word
|
Occurrences in post 1
|
Occurrences in post 2
|
|
disk |
1 |
1 |
|
format |
1 |
1 |
|
how |
1 |
0 |
|
hard |
1 |
1 |
|
my |
1 |
0 |
|
problems |
0 |
1 |
|
to |
1 |
0 |
The columns occurrences in post 2 and occurrences in post 1 can now be treated as vectors. ...
Read now
Unlock full access