Certainly, we have not explored the current setup enough and should investigate more. There are roughly two areas where we can play with the knobs: TfidfVectorizer and MultinomialNB. As we have no real intuition in which area we should explore, let's try to sweep the hyperparameters.
We will see the TfidfVectorizer parameter first:
- Using different settings for ngrams:
- unigrams (1,1)
- unigrams and bigrams (1,2)
- unigrams, bigrams, and trigrams (1,3)
- Playing with min_df: 1 or 2
- Exploring the impact of IDF within TF-IDF using use_idf and smooth_idf: False or True
- Whether to remove stop words or not, by setting stop_words to english or None
- Whether to use the logarithm of the word counts (sublinear_tf)
- Whether ...