Twitter and the Godwin point
With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.
Learning context
As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.
Note
More information about Word2Vec
algorithm can be found at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
Well integrated into Spark, a Word2Vec
model can be trained as follows:
import org.apache.spark.mllib.feature.Word2Vec val corpusRDD = tweetRDD .map(_.body.split("\\s").toSeq) .filter(_.distinct.length ...
Get Mastering Spark for Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.