Datasets

Datasets are strongly typed collections of objects. These objects are usually domain-specific and can be transformed in parallel using relational or functional operations.

These operations are further categorized into actions and transformations. Transformations are functions that generate new datasets, while actions compute datasets and return the transformed results. Transformation functions include Map, FlatMap, Filter, Select, and Aggregate, while Action functions include count, show, and save to any filesystem.

The following instructions will help you create a dataset from a CSV file:

  1. Initialize SparkSession:
//Scalaimport org.apache.spark.sql.SparkSessionval spark = SparkSession.builder().appName("Spark DataSet example").config("spark.config.option", ...

Get Apache Spark Quick Start Guide now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.