Chapter 6. Working with Different Types of Data
Chapter 5 presented basic DataFrame concepts and abstractions. This chapter covers building expressions, which are the bread and butter of Spark’s structured operations. We also review working with a variety of different kinds of data, including the following:
-
Booleans
-
Numbers
-
Strings
-
Dates and timestamps
-
Handling null
-
Complex types
-
User-defined functions
Where to Look for APIs
Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this book is to teach where, as of this writing, you should look to find functions to transform your data. Following are the key places to look:
DataFrame(Dataset) Methods-
This is actually a bit of a trick because a DataFrame is just a Dataset of
Rowtypes, so you’ll actually end up looking at theDatasetmethods, which are available at this link.
Dataset submodules like DataFrameStatFunctions and DataFrameNaFunctions have more methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are relevant when working with null data.
ColumnMethods-
These were introduced for the most part in Chapter 5. They hold a variety of general column-related methods like
aliasorcontains. You can find the API Reference for Column ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access