Summary
In this chapter, we saw how to interoperate with a few big data tools such as Spark, H2O, and ADAM for handling a large-scale genomics dataset. We applied the Spark-based K-means algorithm to genetic variants data from the 1000 Genomes project analysis, aiming to cluster genotypic variants at the population scale.
Then we applied an H2O-based DL algorithm and Spark-based Random Forest models to predict geographic ethnicity. Additionally, we learned how to install and configure H2O for DL. This knowledge will be used in later chapters. Finally and importantly, we learned how to use H2O to compute variable importance in order to select the most important features in a training set.
In the next chapter, we will see how effectively we ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access