Chapter 9Statistical Approximation of Streaming Data

Many elementary properties of data streams can be obtained through basic counting methods. Totals, averages, minimums, maximums, and, to some extent, other order statistics can be computed with O(1) updates and O(1) storage space. Most systems stop at these elementary values because the low-latency requirements and potentially unbounded storage make computing more complicated values prohibitively expensive.

Chapters 9 and 10 tackle this problem from a statistical perspective. Statistics is a field that was, essentially, developed to deal with problems that occur when it is too costly or time consuming to perform a census of the entire population. Instead, the field of Statistics has developed a toolkit that allows a sample to be used to make inferences of the population using the toolkit provided by the mathematics of probability. This chapter provides a brief introduction to statistical methods and concepts, including a useful foundation in probability and statistics used to answer questions about the data rather than simply present tabulated results.

The techniques in this chapter are not specifically related to the analysis of streaming data. They are just as applicable to finite datasets regardless of size. Of course, they can also be applied to data streams with some modifications. Later in this chapter is a discussion of methods of efficiently sampling from streams of data. Statistical analysis can be applied to these ...

Get Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.