O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Text classification with Spark 2.0

In this section, we will use the libsvm version of 20newsgroup data to use the Spark DataFrame-based APIs to classify the text documents. In the current version of Spark libsvm version 3.22 is supported (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)

Download the libsvm formatted data from the following link and copy output folder under Spark-2.0.x.

Visit the following link for the 20newsgroup libsvm data: https://1drv.ms/f/s!Av6fk5nQi2j-iF84quUlDnJc6G6D

Import the appropriate packages from org.apache.spark.ml and create Wrapper Scala:

package org.apache.spark.examples.ml import org.apache.spark.SparkConf import org.apache.spark.ml.classification.NaiveBayes import                  org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required