There are several basic text processing techniques in use here. First, we build a corpus of the text. A corpus is a collection of text streams, typically paragraphs or pages of a book.
We then clean up the corpus in several steps:
- Convert all of the text to lowercase: This facilitates indexing of strings in the text without any concerns about capitalization.
- Remove punctuation: Punctuation is not of interest.
- Remove numbers: Again, we are looking for themes in the page.
- Remove stop words: Remove all the miscellaneous words, such as the, and, and then. I'm not sure if there is a stop words set to exclude all the HTML tags present on web pages.
We cannot produce a document matrix from the corpus. This produces a word index ...