How does it work?

A pipeline is a sequence of stages and each stage is either a Transformer or an Estimator. The stages are run in a sequence in a way that the input frame is transformed as it passes through each stage of the process:

  • Transformer stages: The transform() method on the DataFrame
  • Estimator stages: The fit() method on the DataFrame

A pipeline is created by declaring its stages, configuring appropriate parameters, and then chaining them in a pipeline object. For example, if we were to create a simple classification pipeline we would tokenize the data into columns, use the hashing term feature extractor to extract features, and then build a logistic regression model.

Tip

Please ensure that you add Apache Spark ML Jar either in the ...

Get Learning Apache Spark 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.