Computing the median in a large dataset

As you have seen in the first recipe, computing the median requires having all the values available. With something like a mean, we just need an accumulator and a counter. The fundamental point of this recipe is to introduce the idea of approximate computing; with big data, it may not always be the best strategy to get the precise value (of course, this should be evaluated on a case-by-case basis).

Getting ready

We will require the first recipe to have been fully run.

Here, we will take two different strategies to compute the median: approximating the data points in a way that allows compression of data and subsampling of data.

As usual, this is available in the 08_Advanced/Median.ipynb notebook.

How to do it... ...

