Start with batch processing. Always start with batch! To build our first batch workload, we will use a combination of the services mentioned here:
- S3: To be able to process data in batch mode, we need to have our raw data stored in a place where data can be easily accessible by the services we are going to use. For example, the transformation tool, Glue, must be able to pick up the raw files and write them back once transformations are done.
- Glue: This is the batch transformation tool we will use to create the transformation scripts, schedule the job, and catalog the dataset we are going to create. We also need to create crawlers that will scan the files in S3 and identify the schema—columns and data types—the files contain. ...