Bag-of-words
In bag-of-words representation, every document is represented as a collection of words present in the document, hence the name bag-of-words. The order in which the words occur are not considered in bag of words approach. A good way to organize these bag-of-words is using a matrix representation. Let's say we have 100 documents (also called a corpus). In order to build such a matrix, we first make a list of all the unique words present in those 100 documents. This is called the vocabulary of our text corpus. Say we have 5,000 unique words. Our matrix is now a 100 x 5,000 dimension, where the rows are the document and in the column, we have the words from our vocabulary. Such a matrix is called a document term matrix.
Consider ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access