Now that the infrastructure is in place to collect, process, store, and deliver streaming data, the time has come to put it all together. The core of most applications is the aggregation of data coming through the stream, which is the subject of this chapter.
The place to begin is basic aggregation—counting and summation of various elements. This basic task has varying levels of native support in the popular stream processing frameworks. Probably the most advanced is the support provided by the Trident language portion of Storm, but that assumes a willingness to work within Trident's assumptions. Samza has some support for internal aggregation using its
KeyValueStore interfaces, but it leaves external aggregation and delivery to the user to implement. Basic Storm topologies, of course, have no primitives to speak of and require the user to implement all aspects of the topology. This chapter covers aggregation in all three frameworks using a common example.
Most real-time analysis is somehow related to time-series analysis. This chapter also introduces a method for performing multi-resolution aggregation of time series for later delivery. Additionally, because the data are being transported via mechanisms that can potentially reorder events, this multi-resolution approach relies on record-level timestamps rather than simply generating results on a fixed interval.
Finally, after data has been collected and aggregated into a time series, ...