There are a few different general techniques to deal with stream processing. Two of the most common ones are as follows:
- Treat each record individually and process it as soon as it is seen.
- Combine multiple records into mini-batches. These mini-batches can be delineated either by time or by the number of records in a batch.
Spark Streaming takes the second approach. The core primitive in Spark Streaming is the discretized stream, or DStream. A DStream is a sequence of mini-batches, where each mini-batch is represented as a Spark RDD:
A DStream is defined by its input source ...