Chapter 7. Aggregations: min, max, and Everything in Between
A first step in exploring any dataset is often to compute various summary statistics. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the “typical” values in a dataset, but other aggregations are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
NumPy has fast built-in aggregation functions for working on arrays; we’ll discuss and try out some of them here.
Summing the Values in an Array
As a quick example, consider computing the sum of all values in an
array. Python itself can do this using the built-in sum
function:
In
[
1
]:
import
numpy
as
np
rng
=
np
.
random
.
default_rng
()
In
[
2
]:
L
=
rng
.
random
(
100
)
sum
(
L
)
Out
[
2
]:
52.76825337322368
The syntax is quite similar to that of NumPy’s sum
function, and the result is the same in the simplest case:
In
[
3
]:
np
.
sum
(
L
)
Out
[
3
]:
52.76825337322366
However, because it executes the operation in compiled code, NumPy’s version of the operation is computed much more quickly:
In
[
4
]:
big_array
=
rng
.
random
(
1000000
)
%
timeit
sum(big_array)%
timeit
np.sum(big_array)Out
[
4
]:
89.9
ms
±
233
µs
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
10
loops
each
)
521
µs
±
8.37
µs
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
1000
loops
each
)
Be careful, though: the sum
function and the np.sum
function are not
identical, which can sometimes lead to confusion! In particular, their
optional arguments have different meanings (sum(x, ...
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.