Chapter 7. Interacting with External Data Sources

In Spark, in order to run any algorithm you need to read input data from a data source, then apply your algorithm in the form of a set of PySpark transformations and actions (expressed as a DAG), and finally write your desired output to a target data source. So, to write algorithms that perform well, it’s important to understand reading and writing from and to external data sources.

In the previous chapters, we have explored interacting with the built-in data sources (RDDs and DataFrames) in Spark. In this chapter, we will focus on how Spark interfaces with external data sources.

As Figure 7-1 shows, Spark can read data from a huge range of external storage systems like the Linux filesystem, Amazon S3, HDFS, Hive tables, and relational databases (such as Oracle, MySQL, or PostgreSQL) through its data source interface. This chapter will show you how to read data in and then convert it into RDDs or DataFrames for further processing. I’ll also show you how Spark’s data can be written back to external storage systems like files, Amazon S3, and JDBC-compliant databases.

daws 0701
Figure 7-1. Spark external data sources

Relational Databases

Let’s start with relational databases. A relational database is a collection of data items organized as a set ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.