July 2018
Beginner to intermediate
406 pages
9h 55m
English
Let's have another look at Post 2. Of its words that are not in the new post, we have most, save, images, and permanently. They are quite different in the overall importance to the post. Words such as most appear very often in all sorts of different contexts and are called stop words. They do not carry as much information and thus should not be weighed as high as words such as images, which don't occur often in different contexts. The best option would be to remove all the words that are so frequent that they do not help us to distinguish between different texts. These words are called stop words.
As this is such a common step in text processing, there is a simple parameter in CountVectorizer to achieve that: ...
Read now
Unlock full access