Repeatability and automation

In this section, we will discuss some methods of organizing datasets, preprocessing into workflows, and then use the Apache Spark pipeline to represent as well as implement these workflows. Then, we will review data preprocessing automation solutions.

After this section, we will be able to use Spark pipelines to represent and implement datasets preprocessing workflows and understand some automation solutions made available by Apache Spark.

Dataset preprocessing workflows

Our data preparation work from Data cleaning to Identity matching to Data re-organization to Feature extraction were organized in a way to reflect our step-by-step orderly process of preparing datasets for machine learning. In other words, all the data ...

Get Apache Spark Machine Learning Blueprints now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.