O'Reilly logo

Spark: The Definitive Guide by Matei Zaharia, Bill Chambers

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 5. Basic Structured Operations

In Chapter 4, we introduced the core abstractions of the Structured API. This chapter moves away from the architectural concepts and toward the tactical tools you will use to manipulate DataFrames and the data within them. This chapter focuses exclusively on fundamental DataFrame operations and avoids aggregations, window functions, and joins. These are discussed in subsequent chapters.

Definitionally, a DataFrame consists of a series of records (like rows in a table), that are of type Row, and a number of columns (like columns in a spreadsheet) that represent a computation expression that can be performed on each individual record in the Dataset. Schemas define the name as well as the type of data in each column. Partitioning of the DataFrame defines the layout of the DataFrame or Dataset’s physical distribution across the cluster. The partitioning scheme defines how that is allocated. You can set this to be based on values in a certain column or nondeterministically.

Let’s create a DataFrame with which we can work:

// in Scala
val df = spark.read.format("json")
  .load("/data/flight-data/json/2015-summary.json")
# in Python
df = spark.read.format("json").load("/data/flight-data/json/2015-summary.json")

We discussed that a DataFame will have columns, and we use a schema to define them. Let’s take a look at the schema on our current DataFrame:

df.printSchema()

Schemas tie everything together, so they’re worth belaboring.

Schemas

A schema defines ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required