Tokenizing and normalizing text
Extracting the contents of the page is just the first step. Before we get to the fun part of analyzing what the article contains (or, if you looked at blog posts, what they are about), we need to split the whole article into sentences and further into words.
Having done so, we would still face another issue; in any of the text, we would see sentences in different tenses, people using the passive voice, or some rarely seen grammatical constructs. For the purpose of extracting the topic or analyzing the sentiment, we do not really need to see words said and says separately—the word say would be enough. Thus, we will also be looking at normalizing the text, that is, bringing all the different versions of the same word ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access