O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting features from the Kaggle/StumbleUpon evergreen classification dataset

Before we begin, we will remove the column name header from the first line of the file to make it easier for us to work with the data in Spark. Change to the directory in which you downloaded the data (referred to as PATH here), run the following command to remove the first line, and pipe the result to a new file called train_noheader.tsv:

  > sed 1d train.tsv > train_noheader.tsv 

Now, we are ready to start up our Spark shell (remember to run this command from your Spark installation directory):

  >./bin/spark-shell --driver-memory 4g  

You can type in the code that follows for the remainder of this chapter directly into your Spark shell.

In a manner similar to ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required