Chapter 5 presented basic DataFrame concepts and abstractions. This chapter covers building expressions, which are the bread and butter of Spark’s structured operations. We also review working with a variety of different kinds of data, including the following:
Dates and timestamps
Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this book is to teach where, as of this writing, you should look to find functions to transform your data. Following are the key places to look:
This is actually a bit of a trick because a DataFrame is just a Dataset of
Row types, so you’ll actually end up looking at the
Dataset methods, which are available at this link.
Dataset submodules like
DataFrameNaFunctions have more methods that solve specific sets of problems.
DataFrameStatFunctions, for example, holds a variety of statistically related functions, whereas
DataFrameNaFunctions refers to functions that are relevant when working with null data.