O'Reilly logo

Apache Spark 2.x Machine Learning Cookbook by Shuen Mei, Broderick Hall, Meenakshi Rajendran, Siamak Amirghodsi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Dataset - a high-level unifying Data API

A dataset is an immutable collection of objects which are modelled/mapped to a traditional relational schema. There are four attributes that distinguish it as the preferred method going forward. We particularly find the Dataset API appealing since we find it familiar to RDDs with the usual transformational operators (for example, filter(), map(), flatMap(), and so on). The Dataset will follow a lazy execution paradigm similar to RDD. The best way to try to reconcile DataFrames and Datasets is to think of a DataFrame as an alias that can be thought of as Dataset[Row].

  • Strong type safety: We now have both compile-time (syntax errors) and runtime safety in a unified Data API, which helps the ML developer ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required