Getting ready
The scikit-learn implementation of the TF-IDF uses a slightly different way to calculate the IDF statistic. For more details on the exact formula, visit the scikit-learn documentation: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting.
TF-IDF shares the characteristics of BoW when creating the term matrix, that is, high feature space and sparsity. To reduce the number of features and sparsity, we can remove stop words, set the characters to lowercase, and retain words that appear in a minimum percentage of observations. If you are unfamiliar with these terms, visit the Creating features with bag-of-words and n-grams recipe in this chapter for a recap.
In this recipe, we will learn how to set ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access