O'Reilly logo

Spark for Data Science by Bikramaditya Singhal, Srinivas Duvvuri

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Datasets

Apache Spark Datasets are an extension of the DataFrame API that provide a type-safe object-oriented programming interface. This API was first introduced in the 1.6 release. Spark 2.0 version brought out unification of DataFrame and Dataset APIs. DataFrame becomes a generic, untyped Dataset; or a Dataset is a DataFrame with an added structure. The term "structure" in this context refers to a pattern or an organization of underlying data, more like a table schema in RDBMS parlance. The structure imposes a limit on what can be expressed or contained in the underlying data. This in turn enables better optimizations in memory organization as well as physical execution. Compile-time type checking leads to catching errors earlier than during ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required