After the going through the previous section you should be able to clean the entire corpus and split up sentences. The next steps which involve one hot encoding and tokenizing sentences can be done in the following manner:
- Once the tokens and sequences are saved to a file and loaded into memory, they have to be encoded as integers since the word embedding layer in the model expects input sequences to be comprised of integers and not strings.
- This is done by mapping each word in the vocabulary to a unique integer and encoding the input sequences. Later, while making predictions, the predictions can be converted (or mapped) back to numbers to look up their associated words in the same mapping and reverse map back from integers ...