Interactively loading data from S3
Now let's try another exercise with the Spark shell. As part of Amazon's EMR Spark support, they have handily provided some sample data of Wikipedia traffic statistics in S3, in the format that Spark can use. To access the data, you first need to set your AWS access credentials as shell params. For instructions on signing up for EC2 and setting up the shell parameters, see the Running Spark on EC2 with the scripts section in Chapter 1, Installing Spark and Setting Up Your Cluster (S3 access requires additional keys such as fs.s3n.awsAccessKeyId/awsSecretAccessKey
or the use of the s3n://user:pw@
syntax). You can also set the shell parameters as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. We will leave the AWS ...
Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.