Chapter 14: Data Processing with Apache Spark
In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.
In this chapter, we're going to cover the following main topics:
- Installing and running Spark
- Installing and configuring PySpark
- Processing data with PySpark
Installing and running Spark
Apache Spark is a distributed data processing engine that can handle both streams ...
Get Data Engineering with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.