After 4 days of intense processing, we extracted around 10 million tweets; representing approximately 30 GB worth of JSON data.
Massaging Twitter data
One of the key reasons Twitter became so popular is that any message has to fit into a maximum of 140 characters. The drawback is also that every message has to fit into a maximum of 140 characters! Hence, the result is massive increase in the use of abbreviations, acronyms, slang words, emoticons, and hashtags. In this case, the main emotion may no longer come from the text itself, but rather from the emoticons used (http://dl.acm.org/citation.cfm?id=1628969), though some studies showed that the emoticons may sometimes lead to inadequate predictions in sentiment (https://arxiv.org/pdf/1511.02556.pdf ...