March 2022
Beginner to intermediate
430 pages
9h 24m
English
In the previous few chapters, we explained how you can leverage the EMR cluster for on-demand ETL jobs or long-running clusters that either execute a real-time streaming application or serve as a backend for interactive development using notebooks. But when we build a data pipeline to automate data ingestion, cleansing, or transformations, we look for orchestration tools with which we can build workflows that either get kicked off through a schedule or through an event.
There are two primary orchestration tools – AWS Step Functions and Apache Airflow, which are very popular in building data pipelines with Amazon EMR. AWS also provides a managed offering ...