The Tokenizer API in Keras has several methods that help us to prepare text so it can be used in neural network models. We use the fit_on_texts method and can see the word index using the word_index property.
Keras provides the Tokenizer API for preparing text that can be fit and reused to prepare multiple text documents. A tokenizer is constructed and then fit on text documents or integer encoded text documents; here, words are called tokens and the method of dividing the text into tokens is described as tokenization:
- Keras gives us the text_to_word_sequence API, which can be used to split the text into a list of words:
# use tokenizer and pad maxFeatures = 2000 tokenizer = Tokenizer(num_words=maxFeatures, split=' ') tokenizer.fit_on_texts(X[ ...