Clustering the Twitter dataset
Let's first get a feel for the data extracted from Twitter and get an understanding of the data structure in order to prepare and run it through the K-Means clustering algorithms. Our plan of attack uses the process and dataflow depicted earlier for unsupervised learning. The steps are as follows:
- Combine all tweet files into a single dataframe.
- Parse the tweets, remove stop words, extract emoticons, extract URL, and finally normalize the words (for example, mapping them to lowercase and removing punctuation and numbers).
- Feature extraction includes the following:
- Tokenization: This breaks down the parsed tweet text into individual words or tokens
- TF-IDF: This applies the TF-IDF algorithm to create feature vectors from ...