In Chapters 4 and 5, we covered Spark SQL and the DataFrame API: how to connect to built-in and external data sources; interoperability between SQL and DataFrames; creating and managing views and tables; advanced DataFrame and SQL transformations; and a peek into the Spark SQL engine.
Although we briefly introduced Datasets as strongly-typed immutable collections in Chapter 3, we skimmed over some salient aspects of how Datasets are created, stored, and serialized and deserialized in Spark.
In this chapter, we go under the hood to understand Datasets: how to work with Datasets in Java and Scala, how Spark manages memory to accommodate Dataset constructs as part of the unified and high-level API, and costs for using Datasets.
As you may recall from Chapter 3 (Figure 3-5 and Table 3-6), Datasets offer a unified and singular API for strongly typed-objects for languages such as Scala and Java. Since typed-object is a feature of Java Virtual Machine (JVM), Datasets are unique only to Scala and Java among the language APIs supported in Spark. They are neither part of Python nor R.
But more importantly, they are domain-specific typed-objects that can be operated on in parallel using functional programming or domain specific relational language (DSL) operators we have become so familiar with in the DataFrame API.
This singular API now ensures that Java developers no longer lag behind the Scala API interface since both ...