Computing the median in a large dataset
As you have seen in the first recipe, computing the median requires having all the values available. With something like a mean, we just need an accumulator and a counter. The fundamental point of this recipe is to introduce the idea of approximate computing; with big data, it may not always be the best strategy to get the precise value (of course, this should be evaluated on a case-by-case basis).
We will require the first recipe to have been fully run.
Here, we will take two different strategies to compute the median: approximating the data points in a way that allows compression of data and subsampling of data.
As usual, this is available in the