Chapter 9. Data Sources
This chapter formally introduces the variety of other data sources that you can use with Spark out of the box as well as the countless other sources built by the greater community. Spark has six “core” data sources and hundreds of external data sources written by the community. The ability to read and write from all different kinds of data sources and for the community to create its own contributions is arguably one of Spark’s greatest strengths. Following are Spark’s core data sources:
-
CSV
-
JSON
-
Parquet
-
ORC
-
JDBC/ODBC connections
-
Plain-text files
As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:
-
And many, many others
The goal of this chapter is to give you the ability to read and write from Spark’s core data sources and know enough to understand what you should look for when integrating with third-party data sources. To achieve this, we will focus on the core concepts that you need to be able to recognize and understand.
The Structure of the Data Sources API
Before proceeding with how to read and write from certain formats, let’s visit the overall organizational structure of the data source APIs.
Read API Structure
The core structure for reading data is as follows:
DataFrameReader.format(...).option("key", "value").schema(...).load()
We will use this format to read from all of our data sources. format is optional because by default Spark will ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access