To use Spark Streaming with Kafka, you can do two things: either use a receiver or be direct. The first option is similar to streaming from other sources such as text files and sockets – data received from Kafka is stored in Spark executors and processed by jobs that are launched by a Spark Streaming context. This is not the best approach – it can cause data loss in the event of failures. This means that the direct approach (introduced in Spark 1.3) is better. Instead of using receivers to receive data, it periodically queries Kafka for the latest offsets in each topic and partitions, and accordingly defines, the offset ranges to process for each batch. When the jobs to process the data are executed, Kafka's simple ...
Spark Streaming and Kafka
Get Hands-On Deep Learning with Apache Spark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.