Word2Vec with Spark ML on the 20 Newsgroups dataset

In this section, we look at how to use the Spark ML DataFrame and newer implementations from Spark 2.0.X to create a Word2Vector model.

We will create a DataFrame from the dataSet:

val spConfig = (new   SparkConf).setMaster("local").setAppName("SparkApp")val spark = SparkSession  .builder  .appName("Word2Vec Sample").config(spConfig)  .getOrCreate()import spark.implicits._val rawDF = spark.sparkContext  .wholeTextFiles("./data/20news-bydate-train/alt.atheism/*")  val temp = rawDF.map( x => {    (x._2.filter(_ >= ' ').filter(! _.toString.startsWith("(")) )    })  val textDF = temp.map(x => x.split(" ")).map(Tuple1.apply)    .toDF("text")

This will be followed by creating the Word2Vec class and training ...

Get Machine Learning with Spark - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.