October 2020
Beginner to intermediate
356 pages
6h 50m
English
In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.
In this chapter, we're going to cover the following main topics:
Apache Spark is a distributed data processing engine that can handle both streams ...