What is DataFrame API?
I believe before looking at what a DataFrame API is, we should probably review what an RDD is and identify what could possibly be improved on the RDD interface. RDD has been the user facing API in Apache Spark since its inception and as discussed earlier can represent unstructured data, is compile-time safe, has dependencies, is evaluated lazily, and represents a distributed collection of data across a Spark cluster. RDDs can have partitions, which can be aided by locality info, thus aiding Spark scheduler to allow the computation to be performed on the machines where the data is already available to reduce the costly network overload.
However from a programming perspective, the computation itself is less transparent, as ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access