In chapters 4 and 5, we discussed internal (built-in) and external Data Sources: how Spark reads from and writes to these sources. While in chapter 4 we explored the internal Data Sources (see section “Data Sources for DataFrames and SQL Tables), we did not cover how to organize data while writing onto a disk.
In this chapter, we discuss strategies to organize data such as bucketing and partitioning data for storage, compression schemes, splittable and non-splittable files, and Parquet files.
Both engineers and data scientists will find parts of this chapter useful, as they evaluate what storage format is best suited for downstream consumption for future Spark jobs using the saved data.
Spark’s ability to interact with many data sources—internal and external—extends its functionality to the larger Hadoop ecosystem. Built upon the low-level
OutputFormat interfaces used by Hadoop MapReduce, Spark’s high-level Data Source V2 APIs1 can connect to these data sources. For example, Spark can access storage systems such as S3, Azure Blob, HDFS, NoSQL, etc., giving Spark and its developers immense flexibility to access myriad data sources for data analytics.
Spark supports three sets of data sources:
File formats and filesystems
Structured data sources
Databases and NoSQL (key/value stores)
We covered some structured data sources in chapter 4 and common databases and key/value data sources ...