Cython: Write Python, Run C -- The Bridge Between Productivity and Performance

There is a tension at the heart of Python. The language is celebrated for its readability, its expressiveness, its sheer productivity. But when your algorithm hits a tight numerical loop, when your data pipeline needs to crunch through millions of records per second, when your machine learning model needs to train before the heat death of the universe — Python's interpreted nature becomes a wall.

Cython exists to demolish that wall. It is a programming language that is a superset of Python, meaning that nearly all valid Python code is already valid Cython code. But Cython adds something Python alone cannot offer: the ability to declare C types on variables, call C and C++ functions natively, and compile your code down to highly optimized C extension modules. The result is code that reads like Python but runs like C.

This is not a toy project or an academic curiosity. Cython is the compiled backbone of some of the Python ecosystem's foundational libraries — scikit-learn, SciPy, pandas, and lxml, among others. If you have ever called a machine learning estimator or computed a statistical function in Python, there is a strong chance Cython was doing the real work under the hood.

This article walks through what Cython actually is, how it works at a technical level, where it came from, how to use it in practice with real code, and where it is headed. No surface-level overviews. Real understanding.

From Pyrex to Cython: A History of Practical Necessity

Cython's story begins not with Cython itself, but with its predecessor: Pyrex. In 2002, Greg Ewing at the University of Canterbury in New Zealand created Pyrex as a way to write C extensions for Python without dealing with the notoriously tedious CPython C API. Pyrex allowed you to write Python-like code with optional C type declarations, and it would generate the C code for you.

Pyrex worked, but Ewing maintained a deliberately narrow scope for the project. As Python's scientific computing community grew, developers found themselves needing features that Pyrex did not provide and that Ewing declined to add. Various groups began maintaining their own forks.

"Over the Pyrex mailing list, I got in touch with other developers who had their own more or less enhanced versions of Pyrex, including Robert Bradshaw, one of the developers in the Sage project. Eventually, in 2007, we decided to follow the example of the Apache web server in bringing together the scattered bunch of existing Pyrex patches into a new project." — Stefan Behnel, 2022 retrospective

William Stein from the Sage computer algebra project provided the name and initial infrastructure. The Cython project was born — a unification of scattered forks into a single, community-driven compiler. Pyrex 0.1 had been published on April 4th, 2002; by 2007, Cython had forked and begun charting its own course. As Behnel noted in that same retrospective, Cython "now serves easily hundreds of thousands of developers worldwide, day to day."

How Cython Works: The Compilation Pipeline

Understanding Cython requires understanding its compilation model. Here is what happens when you compile a .pyx file:

your_module.pyx  (Cython source)
       |
       v
  [Cython Compiler]
       |
       v
your_module.c    (Generated C code)
       |
       v
  [C Compiler (gcc, clang, MSVC)]
       |
       v
your_module.so / your_module.pyd  (Compiled extension module)
       |
       v
  import your_module  (Used like any Python module)

The generated .so (Linux/macOS) or .pyd (Windows) file is a standard CPython extension module. You import it with a regular import statement, and Python code that uses it has no idea it was written in Cython rather than pure C.

Note

Cython does not replace CPython. It generates code that runs on CPython. The generated C code makes extensive calls into the CPython interpreter and its C API. This is what makes Cython modules fully compatible with the Python ecosystem — they participate in Python's garbage collection, exception handling, and object model.

Let us see this in action. Create a file called example.pyx:

# example.pyx
def sum_squares_python(n):
    """Pure Python -- no type declarations."""
    total = 0
    for i in range(n):
        total += i * i
    return total

def sum_squares_cython(int n):
    """Typed Cython -- compiles to tight C loops."""
    cdef int i
    cdef long long total = 0
    for i in range(n):
        total += i * i
    return total

The first function uses no type declarations. Cython will still compile it to C, but every operation must go through Python's object protocol — boxing integers, looking up the __mul__ method, and so on. The Cython documentation notes that compiling pure Python this way "usually results only in a speed gain of about 20%-50%."

The second function declares n as a C int, the loop variable i as a C int, and the accumulator total as a C long long. Now the Cython compiler can generate a tight C loop with raw integer arithmetic — no Python objects involved in the hot path. The difference in performance is not incremental; it is often an order of magnitude or more.

To compile and test this, use a setup.py:

# setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("example.pyx"),
)

Then build and benchmark:

# Modern approach — recommended over python setup.py build_ext --inplace
pip install --no-build-isolation -e .

import time
from example import sum_squares_python, sum_squares_cython

n = 10_000_000

start = time.perf_counter()
sum_squares_python(n)
python_time = time.perf_counter() - start

start = time.perf_counter()
sum_squares_cython(n)
cython_time = time.perf_counter() - start

print(f"Pure Python style: {python_time:.4f}s")
print(f"Typed Cython:      {cython_time:.4f}s")
print(f"Speedup:           {python_time / cython_time:.1f}x")
# Typical result: 30-100x speedup depending on hardware

This is not a synthetic benchmark trick. This is the fundamental mechanism that powers performance-critical code across the scientific Python stack.

The Annotation Tool: Seeing Where Python Ends and C Begins

One of Cython's most practical features is its HTML annotation tool, which visually shows you exactly which lines of your code interact with the CPython interpreter and which compile to pure C. Lines highlighted in yellow indicate Python object interactions; white lines indicate pure C.

cython --annotate example.pyx

This generates example.html. Open it in a browser and you will see each line of your code color-coded. The typed function will be almost entirely white (pure C), while the untyped function will be deeply yellow (heavy Python interaction). The scikit-learn project's Cython best practices documentation emphasizes this tool, advising developers that "interactions with the CPython interpreter must be avoided as much as possible in the computationally intensive sections of the algorithms."

Pro Tip

The annotation tool is not just a diagnostic — it is a workflow. Write your Cython code, run the annotator, identify the yellow lines, add type declarations to turn them white, and repeat until your hot path is pure C.

Pure Python Mode: Cython Without Leaving Python

Historically, Cython had its own syntax that blended Python with C-style type declarations — the cdef, cpdef, and cimport keywords. This syntax worked well but created a problem: standard Python linting tools, type checkers, and IDEs could not parse .pyx files.

Cython 3.0, released on July 17, 2023, dramatically expanded pure Python mode — an alternative syntax that is valid Python and can be understood by standard Python tools. In an InfoWorld article covering the release, the key change was described: pure Python mode allows developers to use existing Python tooling on Cython code, closing a long-standing usability gap.

Here is what the same optimized function looks like in pure Python mode:

# example_pure.py  (note: .py extension, not .pyx)
import cython

@cython.ccall
def sum_squares(n: cython.int) -> cython.longlong:
    total: cython.longlong = 0
    i: cython.int
    for i in range(n):
        total += i * i
    return total

This file is valid Python. You can run it uncompiled with the standard Python interpreter (it will just ignore the type hints). But when compiled with Cython, those type annotations trigger the same C-level optimizations as the .pyx syntax. Your IDE's type checker works. Your linter works. Your code runs as both Python and Cython.

The @cython.ccall decorator is the pure Python equivalent of cpdef — it creates a function that is callable from both Python and C. Other decorators include @cython.cfunc (equivalent to cdef, C-only) and @cython.cclass (equivalent to cdef class).

For working with NumPy arrays, Cython's typed memoryviews provide direct buffer access without the overhead of the NumPy Python API:

import cython
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.ccall
def normalize_array(data: cython.double[:]) -> None:
    """Normalize array values to 0-1 range, in-place."""
    n: cython.Py_ssize_t = data.shape[0]
    min_val: cython.double = data[0]
    max_val: cython.double = data[0]
    i: cython.Py_ssize_t
    range_val: cython.double

    for i in range(n):
        if data[i] < min_val:
            min_val = data[i]
        if data[i] > max_val:
            max_val = data[i]

    range_val = max_val - min_val
    if range_val > 0:
        for i in range(n):
            data[i] = (data[i] - min_val) / range_val

Warning

The @cython.boundscheck(False) directive disables array bounds checking, and @cython.wraparound(False) disables support for negative indexing. Both produce faster code but remove safety checks — use them only when you are confident your indices are valid.

Releasing the GIL: True Parallelism in Cython

One of Cython's most powerful capabilities is the ability to release Python's Global Interpreter Lock (GIL) and execute code with true parallelism. Code that does not interact with Python objects can run in parallel across multiple CPU cores.

"Cython code can free the GIL around code sections that need parallelism and do not interact with the Python runtime and Python data structures." — Stefan Behnel, FOSDEM 2018

Cython integrates with OpenMP through the cython.parallel module, providing a prange construct for parallel loops:

# parallel_example.pyx
from cython.parallel import prange

from libc.math cimport sqrt

def parallel_sum_squares(double[:] data):
    """Sum squares of array elements using OpenMP parallelism."""
    cdef Py_ssize_t i
    cdef Py_ssize_t n = data.shape[0]
    cdef double total = 0.0

    # nogil releases the GIL; prange distributes iterations across threads
    for i in prange(n, nogil=True):
        total += data[i] * data[i]

    return total

The nogil=True parameter tells Cython to release the GIL for the duration of the loop. The prange function distributes loop iterations across threads using OpenMP. Every line inside this loop must be free of Python object interactions — only C-level operations are permitted.

This is exactly how scikit-learn achieves parallelism in its performance-critical algorithms. The scikit-learn documentation explicitly notes that nogil declarations are "just hints" about what is permitted, and that you must actively release the GIL using either prange(nogil=True) or a with nogil: context manager.

Wrapping C and C++ Libraries

Beyond accelerating Python code, Cython's other major use case is wrapping existing C and C++ libraries for use in Python. This is what Behnel was referring to in the 2011 IEEE paper "Cython: The Best of Both Worlds" (co-authored with Bradshaw, Citro, Dalcin, Seljebotn, and Smith), which described Cython as addressing "Python's large overhead for numerical loops and the difficulty of efficiently making use of existing C and Fortran code."

Here is a practical example of wrapping a C function. Suppose you have a C library with a fast sorting routine:

// fastsort.h
void quicksort(double* arr, int n);

You can wrap it in Cython like this:

# fastsort.pyx
cdef extern from "fastsort.h":
    void quicksort(double* arr, int n)

import numpy as np

def sort_array(double[:] arr not None):
    """Sort a NumPy array using the C quicksort implementation."""
    quicksort(&arr[0], arr.shape[0])

That is it. The cdef extern from block tells Cython the function's signature. The memoryview double[:] gives you direct pointer access to NumPy array data via &arr[0]. Python code can now call sort_array(my_numpy_array) and the C library does the work.

For C++, Cython provides similar support:

# cpp_example.pyx
# distutils: language=c++

from libcpp.vector cimport vector
from libcpp.algorithm cimport sort

def sorted_vector(list python_list):
    """Convert a Python list to a C++ vector, sort it, and return."""
    cdef vector[double] vec
    for item in python_list:
        vec.push_back(item)
    sort(vec.begin(), vec.end())
    return vec

The Cython compiler generates C++ code (triggered by the language=c++ directive) that uses the STL vector and sort directly. The return statement automatically converts the C++ vector back to a Python list.

Who Uses Cython and Why

Cython's adoption in the scientific Python ecosystem is not incidental — it was a deliberate architectural choice by major projects.

"In scikit-learn, we have decided early on to do Cython, rather than C or C++. That decision has been a clear win because the code is way more maintainable." — Gael Varoquaux, co-creator of scikit-learn

The scikit-learn project uses Cython for kernel fusion, parallel tree building in random forests, distance computations, and LIBSVM/LIBLINEAR wrapper interfaces.

SciPy's documentation describes its own composition as approximately 50% Python, 25% Fortran, 20% C, 3% Cython, and 2% C++. The SciPy contributor guide recommends Cython (alongside Pythran) for new performance-critical additions before reaching for C or C++, with the primary motivation being maintainability: Cython has the highest abstraction level and is understood by a wide range of Python developers.

"The Cython version took about 30 minutes to write, and it runs just as fast as the C code — because, why wouldn't it? It is C code, really, with just some syntactic sugar." — Matthew Honnibal, creator of spaCy

Peter Z. Wang, commenting on SciPy's adoption, described Cython as "rapidly becoming (or has already become) the lingua franca of exposing legacy libraries to Python." Pauli Virtanen and the SciPy core team have articulated the project's language preference more precisely: for new performance-critical additions, Python is the first choice, followed by Cython, then C, C++, and Fortran — in that order — with maintainability as the primary motivation. Mahmoud Hashemi from PayPal noted that while an early attempt at Cython in 2011 did not stick, "since 2015, all native extensions have been written and rewritten to use Cython."

Cython vs. the Alternatives

Cython is not the only option for accelerating Python. Understanding when to use Cython versus its alternatives is a practical skill.

Numba uses LLVM to JIT-compile Python functions decorated with @numba.jit. It requires no separate compilation step and works well for numerical array operations. However, it supports a narrower subset of Python than Cython, cannot wrap external C libraries, and adds a runtime dependency on LLVM. If your bottleneck is a pure numerical loop over NumPy arrays and you want zero build complexity, Numba is excellent. If you need to wrap C libraries or need fine-grained control over memory management, Cython is the better choice.

pybind11 is a C++ header-only library for creating Python bindings. It is a strong choice when you have an existing C++ codebase and want to expose it to Python. However, it requires writing actual C++ code, whereas Cython lets you stay in a Python-like syntax.

Rust + PyO3 is gaining traction for writing Python extensions in Rust, offering memory safety guarantees that neither C nor Cython provide. Projects like Pydantic V2 and Ruff have demonstrated that Rust extensions can be extremely fast. However, this approach requires learning Rust, a language with a steep learning curve.

CPython's own improvements (PEP 659's specializing adaptive interpreter and the experimental JIT compiler) are making pure Python faster with every release. Python 3.11 averaged 25% faster than 3.10 on the pyperformance benchmark suite compiled on Linux, with real-world gains ranging from 10% to 60% depending on workload. But for compute-intensive inner loops, the speedups from Cython (often 10-100x) still dwarf what the interpreter can achieve.

Related PEPs and Python Integration

Cython's evolution has been influenced by and has influenced several Python Enhancement Proposals.

PEP 484 and PEP 526 (Type Hints and Variable Annotations) are the foundation of Cython's pure Python mode. Cython interprets standard Python type annotations, but importantly, cython.int is distinct from Python's int — the former maps to a C int, while the latter remains a Python object. This distinction matters because C integer overflow behavior is fundamentally different from Python's arbitrary-precision integers.

PEP 703 (Making the GIL Optional) directly affects Cython. Cython 3.1, released in May 2025, added experimental support for CPython's free-threaded build. As InfoWorld reported in November 2024, if a free-threaded Python interpreter tries to load a Cython module that was not built for free-threading, it will fall back to GIL mode for compatibility. Cython 3.1 also introduced the cython.critical_section context manager, wrapping Python's critical section C-API feature, to help developers write thread-safe extension type code. Cython 3.2, released in November 2025, continued improving free-threading support by adding cython.pymutex — a fast mutex wrapping CPython's PyMutex for fine-grained locking that does not rely on the GIL.

PEP 489 (Multi-phase Extension Module Initialization) changed how C extension modules are loaded, and Cython adapted to support both the old single-phase and new multi-phase initialization protocols.

PEP 384 (Stable ABI / Limited API) defines a restricted set of CPython C API functions that are guaranteed to remain stable across Python versions. Cython 3.1 added experimental support for building against the Limited API, which would allow a single compiled Cython module to work across multiple Python versions without recompilation.

Practical Workflow: From Python to Cython

Here is a realistic workflow for taking a performance-critical Python function and accelerating it with Cython.

Step 1: Profile first. Use cProfile or py-spy to identify the actual bottleneck. Do not guess.

import cProfile
cProfile.run('your_slow_function()', sort='cumulative')

Step 2: Extract the hot function into a .py file with type annotations.

# hotpath.py
import cython

@cython.ccall
def compute_distances(points: cython.double[:, :],
                       query: cython.double[:],
                       result: cython.double[:]) -> None:
    n: cython.Py_ssize_t = points.shape[0]
    d: cython.Py_ssize_t = points.shape[1]
    i: cython.Py_ssize_t
    j: cython.Py_ssize_t
    diff: cython.double
    dist: cython.double

    for i in range(n):
        dist = 0.0
        for j in range(d):
            diff = points[i, j] - query[j]
            dist += diff * diff
        result[i] = dist ** 0.5

Step 3: Compile and annotate.

# Modern build (recommended):
pip install --no-build-isolation -e .
# Then annotate:
cython --annotate hotpath.py
# Open hotpath.html to check for yellow lines

Step 4: Benchmark the compiled version against the original.

Step 5: If needed, add @cython.boundscheck(False) and prange for further gains.

This file remains valid Python throughout. You can run it uncompiled for debugging, test it with pytest, lint it with ruff or flake8, and compile it for production.

What Is Coming: Cython 3.2, 3.3, and the Free-Threaded Future

Cython's development is active and accelerating. Cython 3.2.0 was released on November 5, 2025, adding cython.pymutex for fine-grained thread-safe locking, improved free-threading support, and further size reductions in generated extension modules. Cython 3.2.4, released on January 4, 2026, continued with stability fixes.

Cython 3.3 is currently in alpha. Its most significant contribution for free-threading is that Cython now automatically inserts critical sections into generated functions — including properties on extension types, auto-generated pickle functions, and dataclass methods — removing a manual step that previously fell to the developer. The @collection_type decorator for extension types, which sets the Py_TPFLAGS_SEQUENCE or mapping flag on a type, was backported as a usable preview in Cython 3.2 and will be fully standardized in 3.3. C++ exception handling also continues to improve.

The free-threaded CPython story is particularly important for Cython's future. For years, one of Cython's selling points was that it could release the GIL for C-level code, giving it access to parallelism that pure Python could not achieve. With PEP 703 making free-threading available to all Python code, Cython needs to adapt — not because it becomes less relevant, but because the concurrency landscape is changing. Cython code that previously relied on the GIL for implicit thread safety must now explicitly handle thread safety. Cython 3.1 introduced cython.critical_section for lightweight locking on Python objects, and Cython 3.2 added cython.pymutex for cases requiring a persistent, GIL-independent mutex.

Final Thoughts

Cython occupies a unique position in the Python ecosystem. It is not a replacement for Python; it is an amplifier. It takes the parts of your codebase where Python's dynamism is a liability — tight numerical loops, C library interfaces, data pipeline bottlenecks — and gives you a way to express them in a language that feels like Python but compiles like C.

The learning curve is real but manageable. If you know Python and have a basic understanding of C types, you can write effective Cython. The annotation tool tells you exactly where to focus your effort. Pure Python mode means you do not even have to leave your familiar syntax.

"The nice thing about Cython is that it doesn't give you 'half the speed of C' or 'maybe nearly the speed of C, 3 years from now' — it gives the real deal, -O3 C, and it works right now." — Fredrik Johansson, mathematician and mpmath contributor

Twenty years after Pyrex first showed that Python and C could meet in the middle — and nearly two decades into Cython's own existence as a distinct project — Cython remains the most mature, most widely deployed bridge between Python's productivity and C's performance. If you write performance-sensitive Python, understanding Cython is not optional. It is how the tools you already depend on actually work.

Cython: Write Python, Run C — The Bridge Between Productivity and Performance

From Pyrex to Cython: A History of Practical Necessity

How Cython Works: The Compilation Pipeline

The Annotation Tool: Seeing Where Python Ends and C Begins

Pure Python Mode: Cython Without Leaving Python

Releasing the GIL: True Parallelism in Cython

Wrapping C and C++ Libraries

Who Uses Cython and Why

Cython vs. the Alternatives

Related PEPs and Python Integration

Practical Workflow: From Python to Cython

What Is Coming: Cython 3.2, 3.3, and the Free-Threaded Future

Final Thoughts