Chapter 24. High-Performance Pandas: eval and query
As we’ve already seen in previous chapters, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.
To address this, Pandas includes some methods that allow you to directly
access C-speed operations without costly allocation of intermediate
arrays: eval
and query
, which rely on the
NumExpr package. In this chapter I
will walk you through their use and give some rules of thumb about when
you might think about using them.
Motivating query and eval: Compound Expressions
We’ve seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:
In
[
1
]:
import
numpy
as
np
rng
=
np
.
random
.
default_rng
(
42
)
x
=
rng
.
random
(
1000000
)
y
=
rng
.
random
(
1000000
)
%
timeit
x + yOut
[
1
]:
2.21
ms
±
142
µs
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
100
loops
each
)
As discussed in Chapter 6, this is much faster than doing the addition via a Python loop or comprehension:
In
[
2
]:
%
timeit
np.fromiter((xi + yi for xi, yi in zip(x, y)),dtype
=
x
.
dtype
,
count
=
len
(
x
))
Out
[
2
]:
263
ms
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.