Chapter 19. Spark Streaming Sources

As you learned earlier in Chapter 2, a streaming source is a data provider that continuously delivers data. In Spark Streaming, sources are adaptors running within the context of the Spark Streaming job that implement the interaction with the external streaming source and provide the data to Spark Streaming using the DStream abstraction. From the programming perspective, consuming a streaming data source means creating a DStream using the appropriate implementation for the corresponding source.

In the “The DStream Abstraction”, we saw an example of how to consume data from a network socket. Let’s revisit that example in Example 19-1.

Example 19-1. Creating a text stream from a socket connection
// creates a DStream using a client socket connected to the given host and port
val textDStream: DStream[String] = ssc.socketTextStream("localhost", 9876)

In Example 19-1, we can see that the creation of a streaming source is provided by a dedicated implementation. In this case, it is provided by the ssc instance, the streaming context, and results in a DStream[String] that contains the text data delivered by the socket typed with the content of the DStream. Although the implementation for each source is different, this pattern is the same for all of them: creating a source requires a streamingContext and results in a DStream that represents the contents of the stream. The streaming application further operates on the resulting DStream to implement ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.