O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

DataFrames

The Spark pipeline is defined by a sequence of stages where each stage is either a transformer or an estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

A DataFrame is a basic data structure or tensor that flows through the pipeline. A DataFrame is represented by a dataset of rows, and supports many types, such as numeric, string, binary, boolean, datetime, and so on.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required