How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. The package statement for the recipe is as follows:
package spark.ml.cookbook.chapter12
  1. Import the necessary packages for Scala and Spark:
import edu.umd.cloud9.collection.wikipedia.WikipediaPage import edu.umd.cloud9.collection.wikipedia.language.EnglishWikipediaPage import org.apache.hadoop.fs.Path import org.apache.hadoop.io.Text import org.apache.hadoop.mapred.{FileInputFormat, JobConf} import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.feature.{HashingTF, IDF} import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.sql.SparkSession import org.tartarus.snowball.ext.PorterStemmer ...

Get Apache Spark 2: Data Processing and Real-Time Analytics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.