This chapter formally introduces the variety of other data sources that you can use with Spark out of the box as well as the countless other sources built by the greater community. Spark has six “core” data sources and hundreds of external data sources written by the community. The ability to read and write from all different kinds of data sources and for the community to create its own contributions is arguably one of Spark’s greatest strengths. Following are Spark’s core data sources:
As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:
The goal of this chapter is to give you the ability to read and write from Spark’s core data sources and know enough to understand what you should look for when integrating with third-party data sources. To achieve this, we will focus on the core concepts that you need to be able to recognize and understand.
Before proceeding with how to read and write from certain formats, let’s visit the overall organizational structure of the data source APIs.
The core structure for reading data is as follows:
We will use this format to read from all of our data sources.
format is optional because by default Spark will ...