4

Batch and Stream Data Processing Using PySpark

When setting up your architecture, you decided whether to support batch or streaming, or both. This chapter will go through the ins and outs of batches and streaming with Apache Spark using Python. Spark can be your go-to tool for moving and processing data at scale. We will also discuss the ins and outs of DataFrames and how to use them in both types of data processing.

In this chapter, we’re going to cover the following main topics:

  • Batch processing
  • Working with schemas
  • User Defined Function
  • Stream processing

Technical requirements

The tooling that will be used in this chapter is tied to the tech stack that was chosen for the book. All vendors should offer a free trial account.

I will be using ...

Get Modern Data Architectures with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.