We run a sequence of algorithms to process and learn from a given dataset. For example, in text classification, we split each document into words and convert the words into a numerical feature vector. Finally, we learn a predictive model using this feature vector and labels.
Spark ML represents such a workflow as a pipeline, which consists of a sequence of PipelineStages (transformers and estimators) to be run in a particular order.
Each stage in PipelineStages is one of the components, either a transformer or an estimator. The stages are run in a particular order while the input DataFrame flows through the stages.
In the ...