Twitter and the Godwin point

With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.

Learning context

As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.

Well integrated into Spark, a Word2Vec model can be trained as follows:

import org.apache.spark.mllib.feature.Word2Vec val corpusRDD = tweetRDD    .map(_.body.split("\\s").toSeq)    .filter(_.distinct.length ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.