Chapter 14: Data Processing with Apache Spark

In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.

In this chapter, we're going to cover the following main topics:

  • Installing and running Spark
  • Installing and configuring PySpark
  • Processing data with PySpark

Installing and running Spark

Apache Spark is a distributed data processing engine that can handle both streams ...

Get Data Engineering with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.