O'Reilly logo

Apache Spark for Data Science Cookbook by Padma Priya Chitturi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Variable identification

In this recipe, we will see how to identify predictor (input) and target (output) variables for data at scale in Spark. Then the next step is to identify the category of the variables.

Getting ready

To step through this recipe, you will need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, you need to have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.

How to do it…

  1. Let's take an example of student's data, using which we want to predict whether a student will play cricket or not. Here is what the sample data looks like:
    How to do it…
  2. The preceding data resides in HDFS and load the data into Spark as follows:
     import org.apache.spark._ ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required