How does it work?

A pipeline is a sequence of stages and each stage is either a Transformer or an Estimator. The stages are run in a sequence in a way that the input frame is transformed as it passes through each stage of the process:

Transformer stages: The transform() method on the DataFrame
Estimator stages: The fit() method on the DataFrame

A pipeline is created by declaring its stages, configuring appropriate parameters, and then chaining them in a pipeline object. For example, if we were to create a simple classification pipeline we would tokenize the data into columns, use the hashing term feature extractor to extract features, and then build a logistic regression model.

Tip

Please ensure that you add Apache Spark ML Jar either in the ...

Get Learning Apache Spark 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning Apache Spark 2 by Muhammad Asif Abbasi

How does it work?

Tip

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly