O'Reilly logo

Fast Data Processing with Spark 2 - Third Edition by Krishna Sankar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Case study - AlphaGo tweets analytics

Now that we have a good understanding of GraphX, let's apply our newly gained knowledge to analyze a retweet network. Like any big data project, the first task is to define a pipeline, figure out the data elements, the source, transformations, mapping, and processing.

Data pipeline

For this case study, I collected Twitter data pertaining to the AlphaGo project:

Data pipeline

While the full mechanics of data collection from Twitter is out of scope, I will quickly mention the main steps:

  1. Using Python and the tweepy framework, you can download the tweets mentioning the hashtag #alphago. Initially, pull all the tweets that Twitter ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required