Chapter 14: Data Processing with Apache Spark

In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.

In this chapter, we're going to cover the following main topics:

  • Installing and running Spark
  • Installing and configuring PySpark
  • Processing data with PySpark

Installing and running Spark

Apache Spark is a distributed data processing engine that can handle both streams ...

Get Data Engineering with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.