The following data preprocessing logic is executed on a Spark cluster. Let's go through the steps:
- We will begin by gathering arguments sent by the SageMaker Notebook instance, as follows:
args = getResolvedOptions(sys.argv, ['S3_INPUT_BUCKET', 'S3_INPUT_KEY_PREFIX', 'S3_INPUT_FILENAME', 'S3_OUTPUT_BUCKET', 'S3_OUTPUT_KEY_PREFIX', 'S3_MODEL_BUCKET', 'S3_MODEL_KEY_PREFIX'])
We will use the getResolvedOptions() utility function from the AWS Glue library to read all the arguments that were sent by the SageMaker notebook instance.
- Next, we will read the news headlines, as follows:
abcnewsdf = spark.read.option("header","true").csv(('s3://' + os.path.join(args['S3_INPUT_BUCKET'], args['S3_INPUT_KEY_PREFIX'], ...