Now let's tie together the modalities with the Spark abstractions and see how we can read and write data. Before 2.0.0, things were conceptually simpler-we only needed to read data into RDDs and use
map() to transform the data as required. However, data wrangling was harder. With Dataset/DataFrame, we have the ability to read directly into a table with headings, associate data types with domain semantics, and start working with data more effectively.
As a general rule of thumb, perform the following steps:
SparkContextand RDDs to handle unstructured data.
SparkSessionand Datasets/DataFrames for semi-structured and structured data. As you will see in the later chapters,
SparkSessionhas unified ...