Our data frames contain raw data that we gathered from Graph API. It contains all kinds of characters that we can find in posts and comments. We have to pre-process them and perform initial information extraction to be able to understand what actual consumers say.
We define the feature extraction process as a pipeline that makes different kinds of transformation in a sequence. The goal at this stage is to extract hashtags, keywords, and noun phrases from posts and comments.
The preprocess() function cleans a raw verbatim (field message in our dataset) from white spaces, punctuation, and converts to lowercase. Then, ...