Chapter 4. Cython in Practice: N-Body Simulation
The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
— F. Brooks
This chapter applies the Cython fundamentals discussed in Chapter 3 to a straightforward but nontrivial example using what we have covered so far. The example starts with a pure-Python N-body simulator to model the solar system, and converts the performance-critical components to use Cython constructs. It comes from the widely known computer language benchmarks game, allowing comparison between the pure-Python, Cython, and C implementations of the same program.
This chapter will give us a better understanding of how Cython is used in practice. The pure-C, pure-Python, and converted Cython versions can be found in the example code repository. Interested readers can follow along with the entire example using this resource.
Overview of the N-Body Python Code
The Python N-body code evolves the positions and velocities of the four Jovian planets in a heliocentric coordinate system. Such a system is chaotic, meaning that the long-term evolution of the system is very sensitive to the initial positions and velocities of all bodies. Small perturbations in the initial conditions lead to arbitrarily diverging results, making prediction difficult. When we are simulating a chaotic system, it is important that the algorithm, or integrator, be highly accurate. For this reason the N-body code uses a symplectic integrator, which is a fancy term for a time-stepping scheme that does a really good job of computing the right trajectories.
The time step and the initial positions, velocities, and masses of the Jovian planets are given. By passing in a command-line argument, we can vary the number of time steps the integrator takes.
The main routine is straightforward. It takes the number of steps to
integrate (n
) the initial conditions of the celestial bodies to integrate,
and a reference body (in this case, the Sun):
def
main
(
n
,
bodies
=
BODIES
,
ref
=
'sun'
):
# ...
It first gets a list of all the bodies and makes pairs of all of them for convenience, as many functions need to iterate over all unique pairs:
# ...
system
=
list
(
bodies
.
values
())
pairs
=
combinations
(
system
)
It then calls offset_momentum
to correct the Sun’s momentum so that it stays
at the system’s center of mass:
# ...
offset_momentum
(
bodies
[
ref
],
system
)
Before running the integrator, main
first calls report_energy
to compute
and print the system’s total energy:
# ...
report_energy
(
system
,
pairs
)
Symplectic integrators are very good at conserving energy, and we will use energy conservation as a way to test the accuracy of the integrator.
After getting the initial energy, we then call advance
, the core of the
computation, and pass in the time step, the number of steps to take, and the
sequence of paired bodies:
# ...
advance
(
0.01
,
n
,
system
,
pairs
)
For this simulation, the unit of time is the mean solar day, the unit of distance is one astronomical unit, and the unit of mass is the solar mass.
After advancing the system, we output the total energy again:
# ...
report_energy
(
system
,
pairs
)
Its value should be close to the total energy computed before advance
was called.
Let’s try it out from the command line:
$ time python nbody.py 500000 -0.169075164 -0.169096567 python nbody.py 500000 13.21s user 0.04s system 99% cpu 13.286 total
The energy before and after match to nearly five decimal places.
This pure-Python version requires about 13 seconds to advance 500,000 steps. When all is said and done, Cython will improve performance by nearly two orders of magnitude, approaching the performance of a pure-C version of the same algorithm.
Converting to Cython
Let’s first run our pure-Python version under cProfile
to quantify where the
runtime is spent:
$ ipython --no-banner In [1]: %run -p nbody.py 500000 71 function calls in 13.897 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 13.880 13.880 13.896 13.896 nbody.py:59(advance) 2 0.015 0.008 0.015 0.008 {range} 1 0.001 0.001 13.897 13.897 {execfile} 2 0.000 0.000 0.000 0.000 nbody.py:82(report_energy) ...
It is not surprising to find that advance
consumes 99.9 percent of the runtime.
Converting it to use static types and more efficient data structures is the
right approach. The rest of the code can remain as is.
Before we begin converting our code to Cython, we first copy the nbody.py file to nbody.pyx, which allows us to use Cython-specific declarations and constructs.
Let’s compile and run the Cython version to ensure the program works correctly.
To compile, we use a simple distuils
script named setup.py:
from
distutils.core
import
setup
from
Cython.Build
import
cythonize
setup
(
name
=
"nbody"
,
ext_modules
=
cythonize
(
"nbody.pyx"
))
We need a run_nbody.py driver script to run the main
function inside our
nbody
extension module:
import
sys
from
nbody
import
main
main
(
int
(
sys
.
argv
[
1
]))
Building our extension is straightforward:
$ python setup.py build_ext -i
(Consult Chapter 2 for platform-specific compilation instructions.)
After compiling our extension, we can test that we obtain the same results as before:
$ time python run_nbody.py 500000 -0.169075164 -0.169096567 python run_nbody.py 500000 4.78s user 0.03s system 99% cpu 4.821 total
The output is identical to the pure-Python version’s, and the performance already improved by a factor of 2.8. Cython provides this performance improvement essentially for free.
With our compilation infrastructure in place, we can turn our attention to improving performance further still.
Python Data Structures and Organization
In Python, each celestial body is represented as a tuple with three elements: two three-element lists for the position and velocity, and a float value for the mass. For example, the Sun’s initial condition is represented by the following three-element tuple:
([
0.0
,
0.0
,
0.0
],
# position
[
0.0
,
0.0
,
0.0
],
# velocity
SOLAR_MASS
# mass
)
And Jupiter’s is:
([
4.84143144246472090e+00
,
-
1.16032004402742839e+00
,
-
1.03622044471123109e-01
],
[
1.66007664274403694e-03
*
DAYS_PER_YEAR
,
7.69901118419740425e-03
*
DAYS_PER_YEAR
,
-
6.90460016972063023e-05
*
DAYS_PER_YEAR
],
9.54791938424326609e-04
*
SOLAR_MASS
),
The global constants DAYS_PER_YEAR
and SOLAR_MASS
are defined normalization
parameters.
The system
variable is a list of these tuples, and pairs
is a list of all
pairs of these tuples. The simulation will access and update the positions and
velocities of all planets frequently, so optimizing their representation is
essential.
The advance
function loops over all steps, and for each step, loops over all
pairs of bodies:
def
advance
(
dt
,
n
,
bodies
,
pairs
):
for
i
in
range
(
n
):
for
(([
x1
,
y1
,
z1
],
v1
,
m1
),
([
x2
,
y2
,
z2
],
v2
,
m2
))
in
pairs
:
# ...update velocities...
Here we use tuple unpacking to extract the positions (x1
, x2
, y1
, y2
,
etc.), the velocity lists v1
and v2
, and the masses m1
and m2
from each
pair in pairs
. The body of the loop updates the velocities according to the
symplectic integration algorithm.
Once the velocities are updated, we update the positions:
for
(
r
,
[
vx
,
vy
,
vz
],
m
)
in
bodies
:
r
[
0
]
+=
dt
*
vx
r
[
1
]
+=
dt
*
vy
r
[
2
]
+=
dt
*
vz
The bodies
and pairs
sequences are set up to refer to the same
objects, so updating the velocities in the first loop allows us to update the
positions in the second, even though we are looping over different sequences.
Converting Data Structures to structs
Our strategy to improve performance is to convert the pure-Python
list-of-tuples-of-lists-of-floats into a C array of C struct
s. With the
C version, accessing and updating the planet’s data will have much better
performance, as these operations will use fast C iteration and optimized
lookups, rather than the general (and slow) iteration and lookups we know to
expect from the Python interpreter.
Let’s define a struct, body_t
, that has two double
arrays for the body’s
position and velocity, and a single double
for its mass:
cdef
struct
body_t
:
double
x
[
3
]
double
v
[
3
]
double
m
We place this struct definition toward the top of nbody.pyx.
Another goal is to leave most of the nbody.py code unmodified, and use
our body_t
struct only where performance matters.
The advance
function needs to convert the Python list of tuples of celestial
body data into a C array of body_t
elements. Let’s make a cdef
function
pair to convert between Python and C data types.
First, make_cbodies
converts a Python list of tuples into a C array of
body_t
structs. It takes a bodies
Python list and a preallocated C array
of body_t
s:
cdef
void
make_cbodies
(
list
bodies
,
body_t
*
cbodies
)
The implementation simply loops over the bodies
list and initializes the
preallocated cbodies
array with the Python list’s data:
cdef
void
make_cbodies
(
list
bodies
,
body_t
*
cbodies
,
int
num_cbodies
):
cdef
body_t
*cbody
for
i
,
body
in
enumerate
(
bodies
):
if
i
>=
num_cbodies
:
break
(
x
,
v
,
m
)
=
body
cbody
=
&
cbodies
[
i
]
cbody
.
x
[
0
],
cbody
.
x
[
1
],
cbody
.
x
[
2
]
=
x
cbody
.
v
[
0
],
cbody
.
v
[
1
],
cbody
.
v
[
2
]
=
v
cbodies
[
i
]
.
m
=
m
Its complement, make_pybodies
, converts a body_t
array into a Python list
of tuples:
cdef
list
make_pybodies
(
body_t
*
cbodies
,
int
num_cbodies
):
pybodies
=
[]
for
i
in
range
(
num_cbodies
):
x
=
[
cbodies
[
i
]
.
x
[
0
],
cbodies
[
i
]
.
x
[
1
],
cbodies
[
i
]
.
x
[
2
]]
v
=
[
cbodies
[
i
]
.
v
[
0
],
cbodies
[
i
]
.
v
[
1
],
cbodies
[
i
]
.
v
[
2
]]
pybodies
.
append
((
x
,
v
,
cbodies
[
i
]
.
m
))
return
pybodies
Now we are ready to convert the for
loops in advance
to use static types.
First, consider the original loop body:
def
advance
(
dt
,
n
,
bodies
,
pairs
):
# ...
for
(([
x1
,
y1
,
z1
],
v1
,
m1
),
([
x2
,
y2
,
z2
],
v2
,
m2
))
in
pairs
:
dx
=
x1
-
x2
dy
=
y1
-
y2
dz
=
z1
-
z2
mag
=
dt
*
((
dx
*
dx
+
dy
*
dy
+
dz
*
dz
)
**
(
-
1.5
))
b1m
=
m1
*
mag
b2m
=
m2
*
mag
v1
[
0
]
-=
dx
*
b2m
v1
[
1
]
-=
dy
*
b2m
v1
[
2
]
-=
dz
*
b2m
v2
[
0
]
+=
dx
*
b1m
v2
[
1
]
+=
dy
*
b1m
v2
[
2
]
+=
dz
*
b1m
The Cython version is as follows:
def
advance
(
double
dt
,
int
n
,
bodies
):
cdef
:
int
i
,
ii
,
jj
double
dx
,
dy
,
dz
,
mag
,
b1m
,
b2m
body_t
*
body1
body_t
*
body2
body_t
cbodies
[
NBODIES
]
make_cbodies
(
bodies
,
cbodies
,
NBODIES
)
for
i
in
range
(
n
):
for
ii
in
range
(
NBODIES
-
1
):
body1
=
&
cbodies
[
ii
]
for
jj
in
range
(
ii
+
1
,
NBODIES
):
body2
=
&
cbodies
[
jj
]
dx
=
body1
.
x
[
0
]
-
body2
.
x
[
0
]
dy
=
body1
.
x
[
1
]
-
body2
.
x
[
1
]
dz
=
body1
.
x
[
2
]
-
body2
.
x
[
2
]
mag
=
dt
*
((
dx
*
dx
+
dy
*
dy
+
dz
*
dz
)
**
(
-
1.5
))
b1m
=
body1
.
m
*
mag
b2m
=
body2
.
m
*
mag
body1
.
v
[
0
]
-=
dx
*
b2m
body1
.
v
[
1
]
-=
dy
*
b2m
body1
.
v
[
2
]
-=
dz
*
b2m
body2
.
v
[
0
]
+=
dx
*
b1m
body2
.
v
[
1
]
+=
dy
*
b1m
body2
.
v
[
2
]
+=
dz
*
b1m
for
ii
in
range
(
NBODIES
):
body2
=
&
cbodies
[
ii
]
body2
.
x
[
0
]
+=
dt
*
body2
.
v
[
0
]
body2
.
x
[
1
]
+=
dt
*
body2
.
v
[
1
]
body2
.
x
[
2
]
+=
dt
*
body2
.
v
[
2
]
return
make_pybodies
(
cbodies
,
NBODIES
)
We convert the for
loop over pairs
into nested for
loops over indices
into the C array of body_t
structs. We use two body_t
pointers to refer to
the current bodies in the pair.
We removed the pairs
argument to advance
, so we need to update main
to
reflect this change, but we will not show the modification here.
Running the Cythonized Version
After recompiling our code, we can run our latest Cython version and see how it compares to the Python version:
$ time python run_nbody.py 500000 -0.169075164 -0.169096567 python run_nbody.py 500000 0.54s user 0.01s system 99% cpu 0.550 total
Our Cython version takes about 0.4 seconds to run, and the energy values are in agreement. This is about 25 times faster than the pure Python version.
We can compare this to the runtime of a serial hand-written C version obtained from the computer language benchmarks game, which we compile with equivalent optimization flags:
$ time ./nbody.x 500000 -0.169075164 -0.169096567 ./nbody.x 500000 0.14s user 0.00s system 97% cpu 0.150 total
Our performance thus far is within a factor of four of the C version.
A quick comparison of the C version’s advance
function and our version
reveals one important difference when the distance is computed—the C version
uses sqrt
:
double
inv_distance
=
1.0
/
sqrt
(
dx
*
dx
+
dy
*
dy
+
dz
*
dz
);
double
mag
=
inv_distance
*
inv_distance
*
inv_distance
;
while our version uses the **
operator, which Cython translates to pow
:
mag
=
dt
*
((
dx
*
dx
+
dy
*
dy
+
dz
*
dz
)
**
(
-
1.5
))
It is straightforward to convert our version to use sqrt
:
ds
=
dx
*
dx
+
dy
*
dy
+
dz
*
dz
mag
=
dt
/
(
ds
*
sqrt
(
ds
))
This requires that we type ds
as a double
and add a cimport
line at the
top of the file (Chapter 6):
from
libc.math
cimport
sqrt
With this minor syntactic change, we see another significant performance boost:
$ time python ./run_nbody.py 500000 -0.169075164 -0.169096567 python ./run_nbody.py 500000 0.15s user 0.01s system 99% cpu 0.159 total
This last improvement yields code that is a factor of 3.6 faster than the previous version, is a factor of 90 faster than the pure-Python version, and brings us within a factor of 1.25 of the pure-C version’s performance.
Summary
This chapter demonstrates how to take numeric-heavy Python code and convert it to Cython, achieving a factor-of-90 boost in performance. The approach we used is straightforward and ensures that we get the most payoff for our efforts.
The steps we followed are:
-
Profile the pure-Python version (using the
cProfile
module or IPython’s%run -p
magic command) to determine where the code spends its time. In this example, nearly all the runtime is spent in the loop-heavyadvance
function. -
Inspect the hotspots for nested
for
loops, numeric-heavy operations, and nested Python containers, all of which can be easily converted with Cython to use more efficient C-level constructs. This example happens to have all of the above. -
Use Cython to declare C data structures equivalent to the Python data
structures identified above. Create converters (if necessary) to transform
Python data to C data. In the N-body simulation, we created a
body_t
struct to represent the nested list-of-tuples-of-lists-of-floats Python data in C, which has better data locality and significantly more efficient access. We also created two converters,make_cbodies
andmake_pybodies
, to convert Python to C and C to Python, respectively. Sometimes these converters are not necessary if Cython can convert the data automatically. -
Convert the hotspots to use our C-level data structures. Remove Python data
structures from nested loops to the extent possible. Ensure all variables
used in nested loops (including the loop variables themselves) are statically
typed. Our
make_pybodies
andmake_cbodies
converters, coupled with plenty ofcdef
declarations, were sufficient in this example. - Test the code to ensure the modifications have not changed the semantics. Profile again. If performance is not satisfactory, use Cython profiling tools (Chapter 9) to draw attention to inefficient code.
- Repeat as necessary.
Another goal of this chapter was to show how to use the components covered in Chapter 3 in a realistic setting. Remembering the Pareto principle (or the 80/20 rule) is useful: we need only use Cython in the 20 percent of the code that occupies 80 percent (or more) of the runtime. The other 80 percent of the code can (and should) remain unmodified.
Studying this example end-to-end is a good exercise for the Cython newcomer; understanding it fully will solidify many core concepts and techniques useful for any Cython project.
Get Cython now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.