O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Exploring the 20 Newsgroups data

We will use a Spark Program to load and analyze the dataset.

object TFIDFExtraction {   def main(args: Array[String]) {  } }

Looking at the directory structure, you might recognize that once again, we have data contained in individual text files (one text file per message). Therefore, we will again use Spark's wholeTextFiles method to read the content of each file into a record in our RDD.

In the code that follows, PATH refers to the directory in which you extracted the 20news-bydate ZIP file:

val sc = new SparkContext("local[2]", "First Spark App") val path = "../data/20news-bydate-train/*" val rdd = sc.wholeTextFiles(path) // count the number of records in the dataset println(rdd.count)

If you put a breakpoint, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required