The Spark pipeline is defined by a sequence of stages where each stage is either a transformer or an estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
A DataFrame is a basic data structure or tensor that flows through the pipeline. A DataFrame is represented by a dataset of rows, and supports many types, such as numeric, string, binary, boolean, datetime, and so on.