Chapter 24. High-Performance Pandas: eval and query
As we’ve already seen in previous chapters, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.
To address this, Pandas includes some methods that allow you to directly
access C-speed operations without costly allocation of intermediate
arrays: eval and query, which rely on the
NumExpr package. In this chapter I
will walk you through their use and give some rules of thumb about when
you might think about using them.
Motivating query and eval: Compound Expressions
We’ve seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:
In[1]:importnumpyasnprng=np.random.default_rng(42)x=rng.random(1000000)y=rng.random(1000000)%timeitx + yOut[1]:2.21ms±142µsperloop(mean±std.dev.of7runs,100loopseach)
As discussed in Chapter 6, this is much faster than doing the addition via a Python loop or comprehension:
In[2]:%timeitnp.fromiter((xi + yi for xi, yi in zip(x, y)),dtype=x.dtype,count=len(x))Out[2]:263ms