Chapter 1. Cython Essentials

The test of a first-rate intelligence is the ability to hold two opposed ideas in mind at the same time and still retain the ability to function.

F. Scott Fitzgerald

Cython is two closely related things:

  • Cython is a programming language that blends Python with the static type system of C and C++.
  • cython is a compiler that translates Cython source code into efficient C or C++ source code. This source can then be compiled into a Python extension module or a standalone executable.

Cython’s power comes from the way it combines Python and C: it feels like Python while providing easy access to C. Cython is situated between high-level Python and low-level C; one might call it a creole programming language.

But Python and C-like languages are so different—why combine them? Precisely because their differences are complementary. Python is high-level, dynamic, easy to learn, and flexible. These positives come with a cost, however—because Python is dynamic and interpreted, it can be several orders of magnitude slower than statically typed compiled languages.

C, on the other hand, is one of the oldest statically typed compiled languages in widespread use, so compilers have had nearly half a century to optimize its performance. C is very low level and very powerful. Unlike Python, it does not have many safeguards in place and can be difficult to use.

Both languages are mainstream, but they are typically used in different domains, given their differences. Cython’s beauty is this: it combines Python’s expressiveness and dynamism with C’s bare-metal performance while still feeling like Python.

With very few exceptions, Python code (both versions 2.x and 3.x) is already valid Cython code. Cython adds a small number of keywords to the Python language to tap into C’s type system, allowing the cython compiler to generate efficient C code. If you already know Python and have a basic understanding of C or C++, you will be able to quickly learn Cython. You do not have to learn yet another interface language.

We can think of Cython as two projects in one. If compiling Python to C is Cython’s yin, then interfacing C or C++ with Python is its yang. We can start with Python code that needs better performance, or we can start with C (or C++) code that needs an optimized Python interface. To speed up Python code, Cython compiles Python source with optional static type declarations to achieve massive performance improvements, depending on the algorithm. To interface C or C++ libraries with Python, we can use Cython to interface with external code and create optimized wrappers. Both capabilities—compiling Python and interfacing with external code—are designed to work together well, and each is an essential part of what makes Cython useful. With Cython, we can move in either direction, coming from either starting point.

Let’s see an example.

Comparing Python, C, and Cython

Consider a simple Python function fib that computes the nth Fibonacci number:[1]

def fib(n):
    a, b = 0.0, 1.0
    for i in range(n):
        a, b = a + b, a
    return a

As mentioned in the introduction, this Python function is already a valid Cython function, and it has identical behavior in both Python and Cython. We will see shortly how we can add Cython-specific syntax to fib to improve its performance.

The C transliteration of fib follows the Python version closely:

double cfib(int n) {
    int i;
    double a=0.0, b=1.0, tmp;
    for (i=0; i<n; ++i) {
        tmp = a; a = a + b; b = tmp;
    }
    return a;
}

We use doubles in the C version and floats in the Python version to make the comparison direct and remove any issues related to integer overflow for C integral data types.

Imagine blending the types from the C version with the code from the Python version. The result is a statically typed Cython version:

def fib(int n):
    cdef int i
    cdef double a=0.0, b=1.0
    for i in range(n):
        a, b = a + b, a
    return a

As mentioned previously, Cython understands Python code, so our unmodified Python fib function is also valid Cython code. To convert the dynamically typed Python version to the statically typed Cython version, we use the cdef Cython statement to declare the statically typed C variables i, a, and b. Even for readers who haven’t seen Cython code before, it should be straightforward to understand what is going on.

What about performance? Table 1-1 has the results.

Table 1-1. Fibonacci timings for different implementations
Versionfib(0) [ns]Speedupfib(90) [ns]SpeedupLoop body [ns]Speedup

Pure Python

590

1

12,852

1

12,262

1

Pure C

2

295

164

78

162

76

C extension

220

3

386

33

166

74

Cython

90

7

258

50

168

73

In Table 1-1,[2] the second column measures the runtime for fib(0) and the third column measures the speedup of fib(0) relative to Python. Because the argument to fib controls the number of loop iterations, fib(0) does not enter the Fibonacci loop, so its runtime is a reasonable measure of the language-runtime and function-call overhead.

The fourth and fifth columns measure the runtime and speedup for fib(90), which executes the loop 90 times. Both the call overhead and the loop execution runtime contribute to its runtime.

The sixth and seventh columns measure the difference between the fib(90) runtime and the fib(0) runtime and the relative speedup. This difference is an approximation of the runtime for the loop alone, removing runtime and call overhead.

Table 1-1 has four rows:

Pure Python
The first row (after the header) measures the performance of the pure-Python version of fib, and as expected, it has the poorest performance by a significant margin in all categories. In particular, the call overhead for fib(0) is over half a microsecond on this system. Each loop iteration in fib(90) requires nearly 150 nanoseconds; Python leaves much room for improvement.
Pure C
The second row measures the performance of the pure-C version of fib. In this version there is no interaction with the Python runtime, so there is minimal call overhead; this also means it cannot be used from Python. This version provides a bound for the best performance we can reasonably expect from a simple serial fib function. The fib(0) value indicates that C function call overhead is minimal (2 nanoseconds) when compared to Python, and the fib(90) runtime (164 nanoseconds) is nearly 80 times faster than Python’s on this particular system.
Hand-written C extension
The third row measures a hand-written C extension module for Python 2. This extension module requires several dozen lines of C code, most of it boilerplate that calls the Python/C API. When calling from Python, the extension module must convert Python objects to C data, compute the Fibonacci number in C, and convert the result back to a Python object. Its call overhead (the fib(0) column) is correspondingly larger than that of the pure-C version, which does not have to convert from and to Python objects. Because it is written in C, it is about three times faster than pure Python for fib(0). It also gives a nice factor-of-30 speedup for fib(90).
Cython
The last row measures the performance for the Cython version. Like the C extension, it is usable from Python, so it must convert Python objects to C data before it can compute the Fibonacci number, and then convert the result back to Python. Because of this overhead, it cannot match the pure-C version for fib(0), but, notably, it has about 2.5 times less overhead than the hand-written C extension. Because of this reduced call overhead, it is able to provide a speedup of about a factor of 50 over pure Python for fib(90).

The takeaways from Table 1-1 are the last two columns: the loop runtime for the pure C, C extension, and Cython versions are all about 165 nanoseconds on this system, and the speedups relative to Python are all approximately 75×.

Note

For the C-only parts of an algorithm—provided sufficient static type information is available—Cython can usually generate code that is as efficient as a pure-C equivalent.

So, when properly accounting for Python overhead, we see that Cython achieves C-level performance. Moreover, it does better than the hand-written C extension module on the Python-to-C conversions.

Note

Cython generates highly optimized code that is frequently faster than an equivalent hand-written C extension module. It is often able to generate Python-to-C conversion code that is several factors faster than naive calls into the Python/C API.

As we will learn in Chapter 3, we can go even further and use Cython to create Python-like C functions that have no Python overhead. These functions can be called from other Cython code but cannot be called directly from Python. They allow us to remove expensive call overhead for core computations.

What is the reason for Cython’s performance improvements? For this example, the likely causes are function call overhead, looping, math operations, and stack versus heap allocations.

Function Call Overhead

The fib(0) runtime is mostly consumed by the time it takes to call a function in the respective language; the time to run the function’s body is relatively small. We see in Table 1-1 that Cython generates code that is nearly an order of magnitude faster than calling a Python function, and more than two times faster than the hand-written extension. Cython accomplishes this by generating highly optimized C code that bypasses some of the slower Python/C API calls. We use these API calls in the preceding C-extension timings.

Looping

Python for loops, as compared to compiled languages, are notoriously slow. One surefire way to speed up loopy Python code is to find ways to move the Python for and while loops into compiled code, either by calling built-in functions or by using something like Cython to do the transformation for you. The fib(90) column in the table is running a for loop in each language for 90 iterations, and we see the impact of this operation on the different version runtimes.

Math Operations

Because Python is dynamically typed and cannot make any type-based optimizations, an expression like a + b could do anything. We may know that a and b are only ever going to be floating-point numbers, but Python never makes that assumption. So, at runtime, Python has to look up the types of both a and b (which, in this instance, are the same). It must then find the type’s underlying __add__ method (or the equivalent), and call __add__ with a and b as arguments. Inside this method, the Python floats a and b have to be unboxed to extract the underlying C doubles, and only then can the actual addition occur! The result of this addition has to be packaged in an entirely new Python float and returned as the result.

The C and Cython versions already know that a and b are doubles and can never be anything else, so adding a and b compiles to just one machine code instruction.

Stack Versus Heap Allocation

At the C level, a dynamic Python object is entirely heap allocated. Python takes great pains to intelligently manage memory, using memory pools and internalizing frequently used integers and strings. But the fact remains that creating and destroying objects—any objects, even scalars—incurs overhead to work with dynamically allocated memory and Python’s memory subsystem. Because Python float objects are immutable, operations using Python floats involve the creation and destruction of heap-allocated objects. The Cython version of fib declares all variables to be stack-allocated C doubles. As a rule, stack allocation is much faster than heap allocation. Moreover, C floating-point numbers are mutable, meaning that the for loop body is much more efficient in terms of allocations and memory usage.

It is not surprising that the C and Cython versions are more than an order of magnitude faster than pure Python, since the Python loop body has to do so much more work per iteration.

Tempering Our Enthusiasm

It can be exhilarating to see massive performance improvements when we add some trivial cdef statements to Python code. It is worth noting at the start, however, that not all Python code will see massive performance improvements when compiled with Cython. The preceding fib example is intentionally CPU bound, meaning all the runtime is spent manipulating a few variables inside CPU registers, and little to no data movement is required. If this function were, instead, memory bound (e.g., adding the elements of two large arrays), I/O bound (e.g., reading a large file from disk), or network bound (e.g., downloading a file from an FTP server), the performance difference between Python, C, and Cython would likely be significantly decreased (for memory-bound operations) or vanish entirely (for I/O-bound or network-bound operations).

When improving Python’s performance is the goal, the Pareto principle works in our favor: we can expect that approximately 80 percent of a program’s runtime is due to only 20 percent of the code. A corollary to this principle is that the 20 percent is very difficult to locate without profiling. But there is no excuse not to profile Python code, given how simple its built-in profiling tools are to use. Before we use Cython to improve performance, getting profiling data is the first step.

That said, if we determine via profiling that the bottleneck in our program is due to it being I/O or network bound, then we cannot expect Cython to provide a significant improvement in performance. It is worth determining the kind of performance bottleneck you have before turning to Cython—it is a powerful tool, but it must be used in the right way.

Because Cython brings C’s type system to Python, all limitations of C data types become relevant concerns. Python integer objects silently convert to unlimited-precision Python long objects when computing large values. C ints or longs are fixed precision, meaning that they cannot properly represent unlimited-precision integers. Cython has features to help catch these overflows, but the larger point remains: C data types are faster than their Python counterparts, but are sometimes not as flexible or general.

Let’s consider Cython’s other main feature: interfacing with external code. Suppose that, instead of Python code, our starting point is C or C++ code, and that we want to create Python wrappers for it. Because Cython understands C and C++ declarations and can interface with external libraries, and because it generates highly optimized code, it is easy to write efficient wrappers with it.

Wrapping C Code with Cython

Continuing with our Fibonacci theme, let’s start with a C implementation and wrap it in Python using Cython. The interface for our function is in cfib.h:

double cfib(int n);

The Cython wrapper code for cfib.h is fewer than 10 lines:

cdef extern from "cfib.h":
    double cfib(int n)

def fib(n):
    """Returns the nth Fibonacci number."""
    return cfib(n)

The cdef extern block may not be immediately transparent, but certain elements are easily identified: we provide the cfib.h header filename in the cdef extern from statement, and we declare the cfib function’s signature in the block’s indented body. After the cdef extern block, we define a fib Python wrapper function, which calls cfib and returns its result.

After compiling the preceding Cython code into an extension module named wrap_fib (we will cover the details of how to compile Cython code in Chapter 2), we can use it from Python:

>>> from wrap_fib import fib
>>> help(fib)
Help on built-in function fib in module wrap_fib:

fib(...)
    Returns the nth Fibonacci number.

>>> fib(90)
2.880067194370816e+18
>>>

We see that the fib function is a regular Python function inside the wrap_fib extension module, and calling it with a Python integer does what we expect, calling into the underlying C function for us and returning a (large) result. Overall, it was just a handful of lines of Cython code to wrap a simple function. A hand-written wrapper would require several dozen lines of C code, and detailed knowledge of the Python/C API. The performance benefits we saw in the previous section apply here as well—Cython’s wrapper code is better optimized than a hand-written version of the same.

This example was intentionally simple. Provided the values are in range, a Python int converts to a C int without issue, and raises an OverflowError otherwise. Internally the Python float type stores its value in a C double, so there are no conversion issues for the cfib return type. Because we are using simple scalar data, Cython can generate the type conversion code automatically. In future chapters, we will see how Cython helps us wrap arbitrarily complex data structures, classes, functions, and methods. Because Cython is a full-fledged language (and not just a domain-specific language for interfacing like other wrapping tools provide), we can use it to do whatever we like before and after the wrapped function call. Because the Cython language understands Python and has access to Python’s standard library, we can leverage all of Python’s power and flexibility.

It should be noted that we can use Cython’s two raisons d'être in one file—speeding up Python alongside calling external C functions. We can even do both inside the same function! We will see this in future chapters.

Summary

This chapter is meant to whet the appetite. We have seen Cython’s essential features, distilled to their most basic elements. The rest of this book explains the Cython language in depth, covers how to compile and run Cython code, describes how to interface with C and C++, and provides many examples to help you use Cython effectively in your own projects.



[1] To follow along with the examples in this chapter, please see https://github.com/cythonbook/examples.

[2] Timings were measured on a four-core 2.4 GHz Intel Core i5 with 8 GB of 1,067 MHz DDR3 memory, running Mac OS X version 10.7.5.

Get Cython now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.