O'Reilly logo

Scala Machine Learning Projects by Md. Rezaul Karim

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data pre-processing and feature engineering

I already stated that all the 24 VCF files contribute 820 GB of data. Therefore, I decided to use the genetic variant of chromosome Y only one two make the demonstration clearer. The size is around 160 MB, which is not meant to pose huge computational challenges. You can download all the VCF files as well as the panel file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

Let us get started. We start by creating SparkSession, the gateway for the Spark application:

val spark:SparkSession = SparkSession    .builder()    .appName("PopStrat")    .master("local[*]")    .config("spark.sql.warehouse.dir", "C:/Exp/")    .getOrCreate()

Then let's show Spark the path of both VCF and the panel file:

val genotypeFile ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required