Consuming data streams
Similar to a batch processing job, we create a new Spark application using a SparkConf
object and a context. In a streaming application, the context is created using a batch size parameter that will be used for any incoming stream (both GDELT and Twitter layers, part of the same context, will both be tied to the same batch size). GDELT data being published every 15 minutes, our batch size will be naturally 15 minutes as we want to predict categories in a pseudo real-time basis:
val sparkConf = new SparkConf().setAppName("GZET") val ssc = new StreamingContext(sparkConf, Minutes(15)) val sc = ssc.sparkContext
Creating a GDELT data stream
There are many ways of publishing external data into a Spark streaming application. One could ...
Get Mastering Spark for Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.