Building text categorization algorithms/models involves a set of preprocessing steps and proper representation of textual data as numerical vectors. Following are the general preprocessing steps:
- Sentence splitting: Split a document into a set of sentences.
- Tokenization: Split sentences into constituent words.
- Stemming or lemmatization: The word tokens are reduced to their base form. For example, words such as playing, played, and plays have one base: play. The base word output of stemming need not be a word in the dictionary. Whereas the root word from lemmatization, also known as the lemma, will always be present in the dictionary.
- Text cleanup: Case conversion, correcting spellings, and removing stopwords ...