Chapter 6. Working with Different Types of Data

Chapter 5 presented basic DataFrame concepts and abstractions. This chapter covers building expressions, which are the bread and butter of Spark’s structured operations. We also review working with a variety of different kinds of data, including the following:

  • Booleans

  • Numbers

  • Strings

  • Dates and timestamps

  • Handling null

  • Complex types

  • User-defined functions

Where to Look for APIs

Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this book is to teach where, as of this writing, you should look to find functions to transform your data. Following are the key places to look:

DataFrame (Dataset) Methods

This is actually a bit of a trick because a DataFrame is just a Dataset of Row types, so you’ll actually end up looking at the Dataset methods, which are available at this link.

Dataset submodules like DataFrameStatFunctions and DataFrameNaFunctions have more methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are relevant when working with null data.

Column Methods

These were introduced for the most part in Chapter 5. They hold a variety of general column-related methods like alias or contains. You can find the API Reference for Column ...

Get Spark: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.