August 2018
Intermediate to advanced
522 pages
12h 45m
English
We are going to build a sample text classifier based on the NLTK Reuters corpus. This one is made up of thousands of news lines divided into 90 categories:
from nltk.corpus import reutersprint(reuters.categories())[u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', ...
To simplify the process, we'll take only two categories, which have a similar number of documents:
import numpy as npXr = np.array(reuters.sents(categories=['rubber']))Xc = np.array(reuters.sents(categories=['cotton']))Xw = np.concatenate((Xr, Xc))
As each document is already split into tokens and we want to apply our custom tokenizer ...
Read now
Unlock full access