Dataset - a high-level unifying Data API

A dataset is an immutable collection of objects which are modelled/mapped to a traditional relational schema. There are four attributes that distinguish it as the preferred method going forward. We particularly find the Dataset API appealing since we find it familiar to RDDs with the usual transformational operators (for example, filter(), map(), flatMap(), and so on). The Dataset will follow a lazy execution paradigm similar to RDD. The best way to try to reconcile DataFrames and Datasets is to think of a DataFrame as an alias that can be thought of as Dataset[Row].

  • Strong type safety: We now have both compile-time (syntax errors) and runtime safety in a unified Data API, which helps the ML developer ...

Get Apache Spark 2.x Machine Learning Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.