July 2018
Beginner to intermediate
406 pages
9h 55m
English
We do not have to write custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer method, does the job efficiently but also has a very convenient interface:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to an integer, all words occurring in fewer documents will be dropped. If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped. The max_df parameter works in a similar manner. If we print the instance, we can see what other parameters ...
Read now
Unlock full access