In the previous section, we developed individual steps using Spark primitives, that is, UDFs, native Spark algorithms, and H2O algorithms. However, to invoke all these transformation on unseen data requires a lot of manual effort. Hence, Spark introduces the concept of pipelines, mainly motivated by Python scikit pipelines (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).
The pipeline is composed of stages that are represented ...