O'Reilly logo

Learning Spark, 2nd Edition by Tathagata Das, Brooke Wenig, Denny Lee, Jules Damji

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6. Loading and Saving Your Data

In chapters 4 and 5, we discussed internal (built-in) and external Data Sources: how Spark reads from and writes to these sources. While in chapter 4 we explored the internal Data Sources (see section “Data Sources for DataFrames and SQL Tables), we did not cover how to organize data while writing onto a disk.

In this chapter, we discuss strategies to organize data such as bucketing and partitioning data for storage, compression schemes, splittable and non-splittable files, and Parquet files.

Both engineers and data scientists will find parts of this chapter useful, as they evaluate what storage format is best suited for downstream consumption for future Spark jobs using the saved data.

Motivation for Data Sources

Spark’s ability to interact with many data sources—internal and external—extends its functionality to the larger Hadoop ecosystem. Built upon the low-level InputFormat and OutputFormat interfaces used by Hadoop MapReduce, Spark’s high-level Data Source V2 APIs1 can connect to these data sources. For example, Spark can access storage systems such as S3, Azure Blob, HDFS, NoSQL, etc., giving Spark and its developers immense flexibility to access myriad data sources for data analytics.

Spark supports three sets of data sources:

  • File formats and filesystems

  • Structured data sources

  • Databases and NoSQL (key/value stores)

We covered some structured data sources in chapter 4 and common databases and key/value data sources ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required