As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:
- Split the document's text into words
- Convert the document's words into a numerical feature vector
- Learn a prediction model from feature vectors and labels
Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.
A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the ...