December 2018
Beginner to intermediate
684 pages
21h 9m
English
The notebook contains an interactive visualization that explores the impact of the min_df and max_df settings on the size of the vocabulary. We read the articles into a DataFrame, set the CountVectorizer to produce binary flags and use all tokens, and call its .fit_transform() method to produce a document-term matrix:
binary_vectorizer = CountVectorizer(max_df=1.0, min_df=1, binary=True)binary_dtm = binary_vectorizer.fit_transform(docs.body)<2225x29275 sparse matrix of type '<class 'numpy.int64'>' with 445870 stored elements in Compressed Sparse Row format>
The output is a scipy.sparse matrix in row format that efficiently stores of the small share (<0.7%) of 445870 non-zero entries in the 2225 (document) rows and