June 2020
Intermediate to advanced
382 pages
11h 39m
English
Topic Modeling is the process of discovering the concepts in a set of documents that can be used to differentiate them. In the context of tweets, it is about finding which are the most appropriate topics in which a set of tweets can be divided. Latent Dirichlet Allocation is a popular algorithm that is used for topic modeling. Because each of the tweet are short 144 character document usually about a very particular topic, we can write a simpler algorithm for topic modeling purposes. The algorithm is described as following:
Tokenize tweets.
Preprocess the data. Remove stopwords, numbers, symbols and perform stemming
Create a Term-Document-Matrix (TDM) for the tweets. Choose the top 200 words that appear most frequently in ...