O'Reilly logo

Fast Data Processing with Spark 2 - Third Edition by Krishna Sankar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data modalities and Datasets/DataFrames/RDDs

Now let's tie together the modalities with the Spark abstractions and see how we can read and write data. Before 2.0.0, things were conceptually simpler-we only needed to read data into RDDs and use map() to transform the data as required. However, data wrangling was harder. With Dataset/DataFrame, we have the ability to read directly into a table with headings, associate data types with domain semantics, and start working with data more effectively.

As a general rule of thumb, perform the following steps:

  1. Use SparkContext and RDDs to handle unstructured data.
  2. Use SparkSession and Datasets/DataFrames for semi-structured and structured data. As you will see in the later chapters, SparkSession has unified ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required