December 2018
Beginner to intermediate
684 pages
21h 9m
English
We are going to load the IMDB dataset from the source to manually preprocess (see notebook).
Keras provides a tokenizer that we use to convert the text documents to integer-encoded sequences, as shown here:
num_words = 10000t = Tokenizer(num_words=num_words, lower=True, oov_token=2)t.fit_on_texts(train_data.review)vocab_size = len(t.word_index) + 1train_data_encoded = t.texts_to_sequences(train_data.review)test_data_encoded = t.texts_to_sequences(test_data.review)
We also use the pad_sequences function to convert the list of lists (of unequal length) to stacked sets of padded and truncated arrays for both the train and test datasets:
max_length = 100X_train_padded = pad_sequences(train_data_encoded, maxlen=max_length, ...