Dataset APIs - an overview

Before we delve into Datasets and data wrangling, let's take a broader view of the APIs; we will focus on the relevant functions we need. This will give us a firm foundation when we wrangle with data later in this chapter. Refer to the following diagram:

Dataset APIs - an overview

The preceding diagram shows the broader hierarchy of the org.apache.spark.sql classes. Interestingly, pyspark.sql mirrors this hierarchy, except for DataFrame, which is basically the Scala Dataset. What I like about the PySpark interface is that it is very succinct and crisp, offering the same power, performance, and functionality as Scala or Java. But Scala has more ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.