Stop word filtering
A basic strategy for reducing the dimensions of the feature space is to convert all of the text to lowercase. This is motivated by the insight that the letter case does not contribute to the meanings of most words; sandwich and Sandwich have the same meaning in most contexts. Capitalization may indicate that a word is beginning a sentence, but the bag-of-words model has already discarded all information from word order and grammar.
A second strategy is to remove words that are common to most of the documents in the corpus. These words, called stop words, frequently include determiners such as "the", "a", and "an"; auxiliary verbs such as "do", "be", and "will"; and prepositions such as "on", "around", and "beneath". Stop ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access