Now let's try another exercise with the Spark shell. As part of Amazon's EMR Spark support, they have handily provided some sample data of Wikipedia traffic statistics in S3, in the format that Spark can use. To access the data, you first need to set your AWS access credentials as shell params. For instructions on signing up for EC2 and setting up the shell parameters, see the Running Spark on EC2 with the scripts section in Chapter 1, Installing Spark and Setting Up Your Cluster (S3 access requires additional keys such as
fs.s3n.awsAccessKeyId/awsSecretAccessKey or the use of the
s3n://user:pw@ syntax). You can also set the shell parameters as
AWS_SECRET_ACCESS_KEY. We will leave the AWS ...