Categorizing newspaper articles and newswires into topics
Articles and newswires denote the huge periodical source of events of knowledge at different periods of time. The classification of text is the preprocessing step to store all these documents into a specific corpus. The categorization of text is the base of text processing.
We will now introduce an N-gram-based text-classification algorithm. From a longer string, an N-character slice is called N-gram. The key point of this algorithm is the calculation of the profiles of the N-gram frequencies.
Before the introduction of the algorithm, here are the necessary illustrations of a couple of concepts adopted in the algorithm:
The N-gram-based text categorization
The summarized pseudocodes for the ...