Preprocessing data using PySpark

The following data preprocessing logic is executed on a Spark cluster. Let's go through the steps:

  1. We will begin by gathering arguments sent by the SageMaker Notebook instance, as follows:

We will use the getResolvedOptions() utility function from the AWS Glue library to read all the arguments that were sent by the SageMaker notebook instance. 

  1. Next, we will read the news headlines, as follows:
abcnewsdf ="header","true").csv(('s3://' + os.path.join(args['S3_INPUT_BUCKET'], args['S3_INPUT_KEY_PREFIX'], ...

