In this recipe, we will see how to generate sample data from the entire population.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed. Readers are expected to have knowledge of sampling techniques.
Let's take an example of load prediction data. Here is what the sample data looks like:
Download the data from the following location https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.
import org.apache.spark._ import org.apache.spark.sql.SQLContext ...