Based on reviewing the text (which we did previously), the following are some operations that could be performed to clean and preprocess the text in the input file. We have presented a few options regarding text preprocessing. However, you may want to explore more cleaning operations as an exercise:
- Replace dashes – with whitespaces so you can split words better
- Split words based on whitespaces
- Remove all punctuation from the input text in order to reduce the number of unique characters in the text that is fed into the model (for example, Why? becomes Why)
- Remove all words that are not alphabetic to remove standalone punctuation tokens and emoticons
- Convert all words from uppercase to lowercase in order to reduce the size ...