Clustering the Twitter dataset

Let's first get a feel for the data extracted from Twitter and get an understanding of the data structure in order to prepare and run it through the K-Means clustering algorithms. Our plan of attack uses the process and dataflow depicted earlier for unsupervised learning. The steps are as follows:

  1. Combine all tweet files into a single dataframe.
  2. Parse the tweets, remove stop words, extract emoticons, extract URL, and finally normalize the words (for example, mapping them to lowercase and removing punctuation and numbers).
  3. Feature extraction includes the following:
    • Tokenization: This breaks down the parsed tweet text into individual words or tokens
    • TF-IDF: This applies the TF-IDF algorithm to create feature vectors from ...

Get Spark for Python Developers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.