May 2017
Intermediate to advanced
310 pages
8h 5m
English
The bag of words is a model that is used for representing text data in such a way that it does not take into consideration the order of words but rather uses word counts to segment words into regions.
Take the following sentences:
sentence_1 = "As fit as a fiddle" sentence_2 = "As you like it"
The bag of words enables us to decompose text into numerical feature vectors represented by a matrix.
To reduce our two sentences into the bag of words model, we need to obtain a unique list of all the words:
set((sentence_1 + sentence_2).split(" "))
This set will become our columns in the matrix. The rows in the matrix will represent the documents that are being used in training. The intersection of a row and column will store the number ...