3. The majestic role of the dataframe

This chapter covers

  • Using the dataframe
  • The essential (majestic) role of the dataframe in Spark
  • Understanding data immutability
  • Quickly debugging a dataframe’s schema
  • Understanding the lower-level storage in RDDs

In this chapter, you will learn about using the dataframe. You’ll learn that the dataframe is so important in a Spark application because it contains typed data through a schema and offers a powerful API.

As you saw in previous chapters, Spark is a marvelous distributed analytics engine. Wikipedia defines an operating system ( OS ) as “system software that manages computer hardware [and] software resources, and provides common services for computer programs.” In chapter 1, I even qualify Spark ...

Get Spark in Action, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.