January 2019
Intermediate to advanced
322 pages
7h 29m
English
To use Spark Streaming with Kafka, you can do two things: either use a receiver or be direct. The first option is similar to streaming from other sources such as text files and sockets – data received from Kafka is stored in Spark executors and processed by jobs that are launched by a Spark Streaming context. This is not the best approach – it can cause data loss in the event of failures. This means that the direct approach (introduced in Spark 1.3) is better. Instead of using receivers to receive data, it periodically queries Kafka for the latest offsets in each topic and partitions, and accordingly defines, the offset ranges to process for each batch. When the jobs to process the data are executed, Kafka's simple ...