Consuming data streams

Similar to a batch processing job, we create a new Spark application using a SparkConf object and a context. In a streaming application, the context is created using a batch size parameter that will be used for any incoming stream (both GDELT and Twitter layers, part of the same context, will both be tied to the same batch size). GDELT data being published every 15 minutes, our batch size will be naturally 15 minutes as we want to predict categories in a pseudo real-time basis:

val sparkConf = new SparkConf().setAppName("GZET")
val ssc = new StreamingContext(sparkConf, Minutes(15))
val sc = ssc.sparkContext

Creating a GDELT data stream

There are many ways of publishing external data into a Spark streaming application. One could ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Spark for Data Science by Andrew Morgan, Antoine Amend, David George, Matthew Hallett

Consuming data streams

Creating a GDELT data stream

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly