Preprocessing data using PySpark

The following data preprocessing logic is executed on a Spark cluster. Let's go through the steps:

  1. We will begin by gathering arguments sent by the SageMaker Notebook instance, as follows:
args = getResolvedOptions(sys.argv, ['S3_INPUT_BUCKET', 'S3_INPUT_KEY_PREFIX', 'S3_INPUT_FILENAME', 'S3_OUTPUT_BUCKET', 'S3_OUTPUT_KEY_PREFIX', 'S3_MODEL_BUCKET', 'S3_MODEL_KEY_PREFIX'])

We will use the getResolvedOptions() utility function from the AWS Glue library to read all the arguments that were sent by the SageMaker notebook instance. 

  1. Next, we will read the news headlines, as follows:
abcnewsdf = spark.read.option("header","true").csv(('s3://' + os.path.join(args['S3_INPUT_BUCKET'], args['S3_INPUT_KEY_PREFIX'], ...

Get Hands-On Artificial Intelligence on Amazon Web Services now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.