Text pre-processing

Before we build our model, we need to prepare our data so it can be provided to our model. We want a feature vector and a class label. In our case, the class label can take two values, positive or negative depending on if the sentence has a positive or a negative sentiment. Words are our features. We will use the bag-of-words model to represent our text as features. In a bag-words-model, the following steps are performed to transform a text into a feature vector:

  1. Extract all unique individual words from the text dataset. We call a text dataset a corpus.
  2. Process the words. Processing typically involves removing numbers and other characters, placing the words in lowercase, stemming the words, and removing unnecessary white ...

Get R Data Analysis Projects now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.