Twitter and the Godwin point

With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.

Learning context

As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.

Well integrated into Spark, a Word2Vec model can be trained as follows:

import org.apache.spark.mllib.feature.Word2Vec val corpusRDD = tweetRDD    .map(_.body.split("\\s").toSeq)    .filter(_.distinct.length ...

