Datasets are the foundational type of the Structured APIs. We already worked with DataFrames, which are Datasets of type
Row, and are available across Spark’s different languages. Datasets are a strictly Java Virtual Machine (JVM) language feature that work only with Scala and Java. Using Datasets, you can define the object that each row in your Dataset will consist of. In Scala, this will be a case class object that essentially defines a schema that you can use, and in Java, you will define a Java Bean. Experienced users often refer to Datasets as the “typed set of APIs” in Spark. For more information, see Chapter 4.
In Chapter 4, we discussed that Spark has types like
StructType, and so on. Those Spark-specific types map to types available in each of Spark’s languages like
Double. When you use the DataFrame API, you do not create strings or integers, but Spark manipulates the data for you by manipulating the
Row object. In fact, if you use Scala or Java, all “DataFrames” are actually Datasets of type
Row. To efficiently support domain-specific objects, a special concept called an “Encoder” is required. The encoder maps the domain-specific type T to Spark’s internal type system.
For example, given a class
Person with two fields,
name (string) and
age (int), an encoder directs Spark to generate code at runtime to serialize the
Person object into a binary structure. When using DataFrames or the “standard” Structured ...