Chapter 1. Cython Essentials
The test of a first-rate intelligence is the ability to hold two opposed ideas in mind at the same time and still retain the ability to function.
— F. Scott Fitzgerald
Cython is two closely related things:
Cython’s power comes from the way it combines Python and C: it feels like Python while providing easy access to C. Cython is situated between high-level Python and low-level C; one might call it a creole programming language.
But Python and C-like languages are so different—why combine them? Precisely because their differences are complementary. Python is high-level, dynamic, easy to learn, and flexible. These positives come with a cost, however—because Python is dynamic and interpreted, it can be several orders of magnitude slower than statically typed compiled languages.
C, on the other hand, is one of the oldest statically typed compiled languages in widespread use, so compilers have had nearly half a century to optimize its performance. C is very low level and very powerful. Unlike Python, it does not have many safeguards in place and can be difficult to use.
Both languages are mainstream, but they are typically used in different domains, given their differences. Cython’s beauty is this: it combines Python’s expressiveness and dynamism with C’s bare-metal performance while still feeling like Python.
With very few exceptions, Python code (both versions 2.x and 3.x) is already
valid Cython code. Cython adds a small number of keywords to the Python
language to tap into C’s type system, allowing the cython
compiler to
generate efficient C code. If you already know Python and have a basic
understanding of C or C++, you will be able to quickly learn Cython.
You do not have to learn yet another interface language.
We can think of Cython as two projects in one. If compiling Python to C is Cython’s yin, then interfacing C or C++ with Python is its yang. We can start with Python code that needs better performance, or we can start with C (or C++) code that needs an optimized Python interface. To speed up Python code, Cython compiles Python source with optional static type declarations to achieve massive performance improvements, depending on the algorithm. To interface C or C++ libraries with Python, we can use Cython to interface with external code and create optimized wrappers. Both capabilities—compiling Python and interfacing with external code—are designed to work together well, and each is an essential part of what makes Cython useful. With Cython, we can move in either direction, coming from either starting point.
Let’s see an example.
Comparing Python, C, and Cython
Consider a simple Python function fib
that computes the nth Fibonacci
number:[1]
def
fib
(
n
):
a
,
b
=
0.0
,
1.0
for
i
in
range
(
n
):
a
,
b
=
a
+
b
,
a
return
a
As mentioned in the introduction, this Python function is already a valid
Cython function, and it has identical behavior in both Python and Cython. We
will see shortly how we can add Cython-specific syntax to fib
to improve its
performance.
The C transliteration of fib
follows the Python version closely:
double
cfib
(
int
n
)
{
int
i
;
double
a
=
0.0
,
b
=
1.0
,
tmp
;
for
(
i
=
0
;
i
<
n
;
++
i
)
{
tmp
=
a
;
a
=
a
+
b
;
b
=
tmp
;
}
return
a
;
}
We use double
s in the C version and float
s in the Python version to
make the comparison direct and remove any issues related to integer overflow
for C integral data types.
Imagine blending the types from the C version with the code from the Python version. The result is a statically typed Cython version:
def
fib
(
int
n
):
cdef
int
i
cdef
double
a
=
0.0
,
b
=
1.0
for
i
in
range
(
n
):
a
,
b
=
a
+
b
,
a
return
a
As mentioned previously, Cython understands Python code, so our unmodified
Python fib
function is also valid Cython code. To convert the dynamically
typed Python version to the statically typed Cython version, we use the cdef
Cython statement to declare the statically typed C variables i
, a
, and b
.
Even for readers who haven’t seen Cython code before, it should be straightforward to understand
what is going on.
What about performance? Table 1-1 has the results.
Version | fib(0) [ns] | Speedup | fib(90) [ns] | Speedup | Loop body [ns] | Speedup |
Pure Python | 590 | 1 | 12,852 | 1 | 12,262 | 1 |
Pure C | 2 | 295 | 164 | 78 | 162 | 76 |
C extension | 220 | 3 | 386 | 33 | 166 | 74 |
Cython | 90 | 7 | 258 | 50 | 168 | 73 |
In Table 1-1,[2] the second column measures the runtime for fib(0)
and the third
column measures the speedup of fib(0)
relative to Python. Because the
argument to fib
controls the number of loop iterations, fib(0)
does not
enter the Fibonacci loop, so its runtime is a reasonable measure of the
language-runtime and function-call overhead.
The fourth and fifth columns measure the runtime and speedup for fib(90)
,
which executes the loop 90 times. Both the call overhead and the loop execution
runtime contribute to its runtime.
The sixth and seventh columns measure the difference between the fib(90)
runtime and the fib(0)
runtime and the relative speedup. This difference is
an approximation of the runtime for the loop alone, removing runtime and call
overhead.
Table 1-1 has four rows:
- Pure Python
-
The first row (after the header) measures the performance of the pure-Python version of
fib
, and as expected, it has the poorest performance by a significant margin in all categories. In particular, the call overhead forfib(0)
is over half a microsecond on this system. Each loop iteration infib(90)
requires nearly 150 nanoseconds; Python leaves much room for improvement. - Pure C
-
The second row measures the performance of the pure-C version of
fib
. In this version there is no interaction with the Python runtime, so there is minimal call overhead; this also means it cannot be used from Python. This version provides a bound for the best performance we can reasonably expect from a simple serialfib
function. Thefib(0)
value indicates that C function call overhead is minimal (2 nanoseconds) when compared to Python, and thefib(90)
runtime (164 nanoseconds) is nearly 80 times faster than Python’s on this particular system. - Hand-written C extension
-
The third row measures a hand-written C extension module for Python 2.
This extension module requires several dozen lines of C code, most of it
boilerplate that calls the Python/C API. When calling from Python, the
extension module must convert Python objects to C data, compute the
Fibonacci number in C, and convert the result back to a Python object. Its
call overhead (the
fib(0)
column) is correspondingly larger than that of the pure-C version, which does not have to convert from and to Python objects. Because it is written in C, it is about three times faster than pure Python forfib(0)
. It also gives a nice factor-of-30 speedup forfib(90)
. - Cython
-
The last row measures the performance for the Cython version. Like the C
extension, it is usable from Python, so it must convert Python objects to C
data before it can compute the Fibonacci number, and then convert the
result back to Python. Because of this overhead, it cannot match the pure-C
version for
fib(0)
, but, notably, it has about 2.5 times less overhead than the hand-written C extension. Because of this reduced call overhead, it is able to provide a speedup of about a factor of 50 over pure Python forfib(90)
.
The takeaways from Table 1-1 are the last two columns: the loop runtime for the pure C, C extension, and Cython versions are all about 165 nanoseconds on this system, and the speedups relative to Python are all approximately 75×.
Note
For the C-only parts of an algorithm—provided sufficient static type information is available—Cython can usually generate code that is as efficient as a pure-C equivalent.
So, when properly accounting for Python overhead, we see that Cython achieves C-level performance. Moreover, it does better than the hand-written C extension module on the Python-to-C conversions.
Note
Cython generates highly optimized code that is frequently faster than an equivalent hand-written C extension module. It is often able to generate Python-to-C conversion code that is several factors faster than naive calls into the Python/C API.
As we will learn in Chapter 3, we can go even further and use Cython to create Python-like C functions that have no Python overhead. These functions can be called from other Cython code but cannot be called directly from Python. They allow us to remove expensive call overhead for core computations.
What is the reason for Cython’s performance improvements? For this example, the likely causes are function call overhead, looping, math operations, and stack versus heap allocations.
Function Call Overhead
The fib(0)
runtime is mostly consumed by the time it takes to call a function
in the respective language; the time to run the function’s body is relatively
small. We see in Table 1-1 that Cython generates code that is
nearly an order of magnitude faster than calling a Python function, and more
than two times faster than the hand-written extension. Cython accomplishes
this by generating highly optimized C code that bypasses some of the slower
Python/C API calls. We use these API calls in the preceding C-extension
timings.
Looping
Python for
loops, as compared to compiled languages, are notoriously slow.
One surefire way to speed up loopy Python code is to find ways to move the
Python for
and while
loops into compiled code, either by calling built-in
functions or by using something like Cython to do the transformation for you.
The fib(90)
column in the table is running a for
loop in each language for
90 iterations, and we see the impact of this operation on the different version
runtimes.
Math Operations
Because Python is dynamically typed and cannot make any type-based
optimizations, an expression like a + b
could do anything. We may
know that a
and b
are only ever going to be floating-point numbers, but
Python never makes that assumption. So, at runtime, Python has to look up the
types of both a
and b
(which, in this instance, are the same). It must then find the type’s underlying __add__
method (or
the equivalent), and call __add__
with a
and b
as arguments. Inside
this method, the Python float
s a
and b
have to be unboxed to extract the
underlying C double
s, and only then can the actual addition occur! The
result of this addition has to be packaged in an entirely new Python float
and returned as the result.
The C and Cython versions already know that a
and b
are double
s and can
never be anything else, so adding a
and b
compiles to just one machine code
instruction.
Stack Versus Heap Allocation
At the C level, a dynamic Python object is entirely heap allocated. Python
takes great pains to intelligently manage memory, using memory pools and
internalizing frequently used integers and strings. But the fact remains that
creating and destroying objects—any objects, even scalars—incurs overhead
to work with dynamically allocated memory and Python’s memory subsystem.
Because Python float
objects are immutable, operations using Python
float
s involve the creation and destruction of heap-allocated objects. The
Cython version of fib
declares all variables to be stack-allocated C
double
s. As a rule, stack allocation is much faster than heap allocation.
Moreover, C floating-point numbers are mutable, meaning that the for
loop
body is much more efficient in terms of allocations and memory usage.
It is not surprising that the C and Cython versions are more than an order of magnitude faster than pure Python, since the Python loop body has to do so much more work per iteration.
Tempering Our Enthusiasm
It can be exhilarating to see massive performance improvements when we add some
trivial cdef
statements to Python code. It is worth noting at the start, however, that
not all Python code will see massive performance improvements when compiled
with Cython. The preceding fib
example is intentionally CPU bound, meaning
all the runtime is spent manipulating a few variables inside CPU registers, and
little to no data movement is required. If this function were, instead, memory
bound (e.g., adding the elements of two large arrays), I/O bound (e.g., reading a
large file from disk), or network bound (e.g., downloading a file from an FTP
server), the performance difference between Python, C, and Cython would likely
be significantly decreased (for memory-bound operations) or vanish entirely (for
I/O-bound or network-bound operations).
When improving Python’s performance is the goal, the Pareto principle works in our favor: we can expect that approximately 80 percent of a program’s runtime is due to only 20 percent of the code. A corollary to this principle is that the 20 percent is very difficult to locate without profiling. But there is no excuse not to profile Python code, given how simple its built-in profiling tools are to use. Before we use Cython to improve performance, getting profiling data is the first step.
That said, if we determine via profiling that the bottleneck in our program is due to it being I/O or network bound, then we cannot expect Cython to provide a significant improvement in performance. It is worth determining the kind of performance bottleneck you have before turning to Cython—it is a powerful tool, but it must be used in the right way.
Because Cython brings C’s type system to Python, all limitations of C data
types become relevant concerns. Python integer objects silently convert to
unlimited-precision Python long objects when computing large values. C
int
s or long
s are fixed precision, meaning that they cannot properly
represent unlimited-precision integers. Cython has features to help catch
these overflows, but the larger point remains: C data types are faster than
their Python counterparts, but are sometimes not as flexible or general.
Let’s consider Cython’s other main feature: interfacing with external code. Suppose that, instead of Python code, our starting point is C or C++ code, and that we want to create Python wrappers for it. Because Cython understands C and C++ declarations and can interface with external libraries, and because it generates highly optimized code, it is easy to write efficient wrappers with it.
Wrapping C Code with Cython
Continuing with our Fibonacci theme, let’s start with a C implementation and wrap it in Python using Cython. The interface for our function is in cfib.h:
double
cfib
(
int
n
);
The Cython wrapper code for cfib.h is fewer than 10 lines:
cdef
extern
from
"cfib.h"
:
double
cfib
(
int
n
)
def
fib
(
n
):
"""Returns the nth Fibonacci number."""
return
cfib
(
n
)
The cdef extern
block may not be immediately transparent, but certain
elements are easily identified: we provide the cfib.h
header filename in the
cdef extern from
statement, and we declare the cfib
function’s signature in
the block’s indented body. After the cdef extern
block, we define a
fib
Python wrapper function, which calls cfib
and returns its result.
After compiling the preceding Cython code into an extension module named
wrap_fib
(we will cover the details of how to compile Cython code in
Chapter 2), we can use it from Python:
>>>
from
wrap_fib
import
fib
>>>
help
(
fib
)
Help on built-in function fib in module wrap_fib:
fib(...)
Returns the nth Fibonacci number.
>>>
fib
(
90
)
2.880067194370816e+18
>>>
We see that the fib
function is a regular Python function inside the
wrap_fib
extension module, and calling it with a Python integer does what
we expect, calling into the underlying C function for us and returning a
(large) result. Overall, it was just a handful of lines of Cython code to wrap
a simple function. A hand-written wrapper would require several dozen lines of
C code, and detailed knowledge of the Python/C API. The
performance benefits we saw in the previous section apply here as well—Cython’s
wrapper code is better optimized than a hand-written version of the
same.
This example was intentionally simple. Provided the values are in range, a
Python int
converts to a C int
without issue, and raises an OverflowError
otherwise. Internally the Python float
type stores its value in a C
double
, so there are no conversion issues for the cfib
return type.
Because we are using simple scalar data, Cython can generate the type
conversion code automatically. In future chapters, we will see how Cython
helps us wrap arbitrarily complex data structures, classes, functions, and
methods. Because Cython is a full-fledged language (and not just a
domain-specific language for interfacing like other wrapping tools provide), we
can use it to do whatever we like before and after the wrapped function call.
Because the Cython language understands Python and has access to Python’s
standard library, we can leverage all of Python’s power and flexibility.
It should be noted that we can use Cython’s two raisons d'être in one file—speeding up Python alongside calling external C functions. We can even do both inside the same function! We will see this in future chapters.
Summary
This chapter is meant to whet the appetite. We have seen Cython’s essential features, distilled to their most basic elements. The rest of this book explains the Cython language in depth, covers how to compile and run Cython code, describes how to interface with C and C++, and provides many examples to help you use Cython effectively in your own projects.
[1] To follow along with the examples in this chapter, please see https://github.com/cythonbook/examples.
[2] Timings were measured on a four-core 2.4 GHz Intel Core i5 with 8 GB of 1,067 MHz DDR3 memory, running Mac OS X version 10.7.5.
Get Cython now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.