Repeatability and automation
In this section, we will discuss some methods of organizing datasets, preprocessing into workflows, and then use the Apache Spark pipeline to represent as well as implement these workflows. Then, we will review data preprocessing automation solutions.
After this section, we will be able to use Spark pipelines to represent and implement datasets preprocessing workflows and understand some automation solutions made available by Apache Spark.
Dataset preprocessing workflows
Our data preparation work from Data cleaning to Identity matching to Data re-organization to Feature extraction were organized in a way to reflect our step-by-step orderly process of preparing datasets for machine learning. In other words, all the data ...