Preprocessing data using PySpark

The following data preprocessing logic is executed on a Spark cluster. Let's go through the steps:

  1. We will begin by gathering arguments sent by the SageMaker Notebook instance, as follows:
args = getResolvedOptions(sys.argv, ['S3_INPUT_BUCKET', 'S3_INPUT_KEY_PREFIX', 'S3_INPUT_FILENAME', 'S3_OUTPUT_BUCKET', 'S3_OUTPUT_KEY_PREFIX', 'S3_MODEL_BUCKET', 'S3_MODEL_KEY_PREFIX'])

We will use the getResolvedOptions() utility function from the AWS Glue library to read all the arguments that were sent by the SageMaker notebook instance. 

  1. Next, we will read the news headlines, as follows:
abcnewsdf = spark.read.option("header","true").csv(('s3://' + os.path.join(args['S3_INPUT_BUCKET'], args['S3_INPUT_KEY_PREFIX'], ...

Get Hands-On Artificial Intelligence on Amazon Web Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.