Dataset preparation for training

Since we do not have any unlabeled data, I would like to select some samples randomly for test. Well, one more thing is that features and labels come in two separate files. Therefore, we can perform the necessary preprocessing and then join them together so that our pre-processed data will have features and labels together.

Then the rest will be used for training. Finally, we'll save the training and testing set in a separate CSV file to be used later on. First, let's load the samples and see the statistics. By the way, we use the read() method of Spark but specify the necessary options and format too:

Dataset<Row> data = spark.read()                .option("maxColumns", 25000)                .format("com.databricks.spark.csv") .option("header", ...

Get Java Deep Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.