January 2018
Beginner to intermediate
284 pages
8h 35m
English
BoW is mainly for categorizing documents. It is also used in computer vision. The idea is to represent the document as a bag or a set of words, disregarding the grammar and the order of the word sequences.
After the preprocessing of the text, often called the corpus, a set of vocabulary is generated and BoW representation for each document is built on top of it.
Take the following two text samples as an example:
“The quick brown fox jumps over the lazy dog”“never jump over the lazy dog quickly”
The corpus (text samples) then form a dictionary with the key as the word and the second column as the word ID:
{ 'brown': 0, 'dog': 1, 'fox': 2, 'jump': 3, 'jumps': 4, 'lazy': 5, 'never': 6, 'over': 7, 'quick': 8, 'quickly': 9, 'the': ...Read now
Unlock full access