In this section, we look at how to use the Spark ML DataFrame and newer implementations from Spark 2.0.X to create a Word2Vector model.
We will create a DataFrame from the dataSet:
val spConfig = (new SparkConf).setMaster("local").setAppName("SparkApp")val spark = SparkSession .builder .appName("Word2Vec Sample").config(spConfig) .getOrCreate()import spark.implicits._val rawDF = spark.sparkContext .wholeTextFiles("./data/20news-bydate-train/alt.atheism/*") val temp = rawDF.map( x => { (x._2.filter(_ >= ' ').filter(! _.toString.startsWith("(")) ) }) val textDF = temp.map(x => x.split(" ")).map(Tuple1.apply) .toDF("text")
This will be followed by creating the Word2Vec class and training ...