Chapter 6. Working with Different Types of Data
Chapter 5 presented basic DataFrame concepts and abstractions. This chapter covers building expressions, which are the bread and butter of Spark’s structured operations. We also review working with a variety of different kinds of data, including the following:
-
Booleans
-
Numbers
-
Strings
-
Dates and timestamps
-
Handling null
-
Complex types
-
User-defined functions
Where to Look for APIs
Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this book is to teach where, as of this writing, you should look to find functions to transform your data. Following are the key places to look:
DataFrame
(Dataset
) Methods-
This is actually a bit of a trick because a DataFrame is just a Dataset of
Row
types, so you’ll actually end up looking at theDataset
methods, which are available at this link.
Dataset
submodules like DataFrameStatFunctions
and DataFrameNaFunctions
have more methods that solve specific sets of problems. DataFrameStatFunctions
, for example, holds a variety of statistically related functions, whereas DataFrameNaFunctions
refers to functions that are relevant when working with null data.
Column
Methods-
These were introduced for the most part in Chapter 5. They hold a variety of general column-related methods like
alias
orcontains
. You can find the API Reference for Column ...
Get Spark: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.