December 2018
Beginner to intermediate
684 pages
21h 9m
English
We split the data into the default 75:25 train-test sets, ensuring that test set classes closely mirror the train set:
y = pd.factorize(docs.topic)[0] # create integer class valuesX = docs.bodyX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
We proceed to learn the vocabulary from the training set and transform both datasets using CountVectorizer with default settings to obtain almost 26,000 features:
vectorizer = CountVectorizer()X_train_dtm = vectorizer.fit_transform(X_train)X_test_dtm = vectorizer.transform(X_test)X_train_dtm.shape, X_test_dtm.shape((1668, 25919), (557, 25919))
Training and prediction follow the standard sklearn fit/predict ...