Skip to Main Content
Data Algorithms with Spark
book

Data Algorithms with Spark

by Mahmoud Parsian
April 2022
Intermediate to advanced content levelIntermediate to advanced
435 pages
9h 44m
English
O'Reilly Media, Inc.
Book available
Content preview from Data Algorithms with Spark

Chapter 7. Interacting with External Data Sources

In Spark, in order to run any algorithm you need to read input data from a data source, then apply your algorithm in the form of a set of PySpark transformations and actions (expressed as a DAG), and finally write your desired output to a target data source. So, to write algorithms that perform well, it’s important to understand reading and writing from and to external data sources.

In the previous chapters, we have explored interacting with the built-in data sources (RDDs and DataFrames) in Spark. In this chapter, we will focus on how Spark interfaces with external data sources.

As Figure 7-1 shows, Spark can read data from a huge range of external storage systems like the Linux filesystem, Amazon S3, HDFS, Hive tables, and relational databases (such as Oracle, MySQL, or PostgreSQL) through its data source interface. This chapter will show you how to read data in and then convert it into RDDs or DataFrames for further processing. I’ll also show you how Spark’s data can be written back to external storage systems like files, Amazon S3, and JDBC-compliant databases.

daws 0701
Figure 7-1. Spark external data sources

Relational Databases

Let’s start with relational databases. A relational database is a collection of data items organized as a set ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms

Data Algorithms

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781492082378Errata PageSupplemental Content