Before we begin, we will remove the column name header from the first line of the file to make it easier for us to work with the data in Spark. Change to the directory in which you downloaded the data (referred to as PATH here), run the following command to remove the first line, and pipe the result to a new file called train_noheader.tsv:
> sed 1d train.tsv > train_noheader.tsv
Now, we are ready to start up our Spark shell (remember to run this command from your Spark installation directory):
>./bin/spark-shell --driver-memory 4g
You can type in the code that follows for the remainder of this chapter directly into your Spark shell.
In a manner similar to ...