How it works...
In this recipe, we extracted the TF-IDF values of words present in at least 5% of the documents utilizing TfidfVectorizer() from scikit-learn.
We loaded the 20 Newsgroup text dataset from scikit-learn and then removed punctuation and numbers from the text rows using pandas' replace(), which can be accessed through pandas' str, to replace digits, '\d+', or symbols, '[^\w\s]', with empty strings, ''. Then, we used TfidfVectorizer() to create TF-IDF statistics for words. We set the lowercase parameter to True to put words in lowercase before making the calculations. We set the stop_words argument to english to avoid stop words in the returned matrix. We set ngram_range to the (1,1) tuple to return single words as features. Finally, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access