© Ed Elliott 2021
E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_5

5. The DataFrame API

Ed Elliott1  
(1)
Sussex, UK
 

In this chapter, we will be having a look at the DataFrame API, which is the core API that we will use with .NET for Apache Spark. Apache Spark has a couple of different APIs, the Resilient Distributed Dataset (RDD) and DataFrame APIs, for processing. We will cover what the APIs are and why the RDD API is not available in .NET and that it is fine; the DataFrame API gives us everything we need.

The RDD API vs. the DataFrame API

The Resilient Distributed Dataset (RDD) API provides access to RDDs. RDDs are an abstraction over what could be massive data files by partitioning the files and spreading the ...

Get Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.