Structure has permeated our society, and systematically ordered our livelihood. Similarly, structured (or organized) data allows us to accomplish simple or complex tasks in a systematic manner.
The action or process of building or constructing
Put together systematically or arrange according to a plan or give a pattern or organization to
In this chapter, we will explore the principal motivations behind adding structure to Apache Spark, how structure led to the creation of high-level APIs - DataFrames and Datasets - and their unification in Spark 2.x across its components, and the Spark SQL engine that underpins these structured high-level APIs.
When Spark SQL was first introduced in the early Spark 1.x releases3, followed by DataFrames as a successor to SchemaRDDs4 5 in Spark 1.3, we got our first glimpse of structure in Spark. At this time, Spark SQL introduced high-level expressive operational functions, mimicking SQL-like syntax, and DataFrames by providing spreadsheet-like named columns with data types dictated by a schema. DataFrames laid the foundation for more structure in subsequent releases and paved the path to performant operations in Spark’s computational queries.
But before we talk about structure in Spark, let’s get a brief glimpse of what it means to not have structure in Spark by peeking into the simple ...