O'Reilly logo

Fast Data Processing with Spark 2 - Third Edition by Krishna Sankar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 9. Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists

From a data wrangling perspective, Datasets are the most important feature of Spark 2.0.0. In this chapter, we will first look at Datasets from a stack perspective, including layering, optimizations, and so forth. Then we will delve more deeply into the actual Dataset APIs and cover the various operations, starting from reading various formats to creating Datasets and finally covering the rich capabilities for queries, aggregations, and scientific operations. We will use the car and orders Datasets for our examples.

Datasets - a quick introduction

A Spark Dataset is a group of specified heterogeneous columns, akin to a spreadsheet or a relational database ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required