CPython: The Engine Under Python's Hood That Every Developer Should Understand

When you type python into your terminal and run a script, you are almost certainly running CPython. It is the reference implementation of the Python programming language — written in C and Python itself — and it is the default interpreter that ships with virtually every standard Python installation. Yet despite being the runtime that powers the vast majority of Python code worldwide, many developers never look beneath the surface to understand what CPython actually is, how it works, and why it matters.

This article walks through CPython’s architecture, traces its history from a Christmas hobby project to the backbone of modern computing, examines the landmark PEPs that are reshaping its internals, and demonstrates through real code how CPython executes your programs.

What CPython Actually Is (and Is Not)

CPython is both a compiler and an interpreter. When you execute a .py file, CPython first compiles your source code into an intermediate representation called bytecode, then interprets that bytecode instruction by instruction in its virtual machine. Those .pyc files you see in __pycache__ directories? That is the compiled bytecode CPython generated from your source.

This is a crucial distinction. Python the language is a specification. CPython is one implementation of that specification — by far the dominant one. Other implementations exist: PyPy (which uses a tracing JIT compiler), Jython (which runs on the JVM), IronPython (which targets .NET), and GraalPython (which runs on GraalVM). But when the Python documentation says “Python does X,” it almost always means “CPython does X.”

Guido van Rossum, Python’s creator, started working on the language during Christmas break in December 1989 while at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. In an interview on the Dropbox Blog in 2020, van Rossum described the core design philosophy that continues to guide CPython today: you primarily write code to communicate with other developers, and only secondarily to instruct the computer. That human-first philosophy is baked into every layer of CPython’s design.

How CPython Executes Your Code

Let us trace what happens when CPython runs a simple function. Consider this code:

def add_and_square(a, b):
    total = a + b
    return total ** 2

When CPython encounters this, several stages unfold.

Stage 1: Lexing and Parsing. CPython’s parser (rewritten as a PEG parser in Python 3.9 via PEP 617) tokenizes your source code and builds an Abstract Syntax Tree (AST).

Stage 2: Compilation to Bytecode. The AST is compiled into bytecode instructions. You can inspect this yourself using the dis module:

import dis
dis.dis(add_and_square)

This produces output like:

  2           LOAD_FAST                0 (a)
              LOAD_FAST                1 (b)
              BINARY_OP               0 (+)
              STORE_FAST              2 (total)

  3           LOAD_FAST                2 (total)
              LOAD_CONST              1 (2)
              BINARY_OP               8 (**)
              RETURN_VALUE

Each of these instructions is a single operation in CPython’s stack-based virtual machine. LOAD_FAST pushes a local variable onto the stack, BINARY_OP pops two values and pushes the result, and RETURN_VALUE sends the top-of-stack back to the caller.

Note

The bytecode shown above is simplified for clarity. Modern CPython (3.11+) inserts inline CACHE entries after several instructions, and 3.11+ also emits a RESUME instruction at the start of every function. Running dis.dis() on your own machine will show these entries. To see the full cache layout, pass show_caches=True to dis.dis().

Stage 3: Execution in the VM. CPython’s evaluation loop — the famous ceval.c file (now generated from a domain-specific language as of Python 3.12) — reads each bytecode instruction and dispatches it. This loop is where your Python code actually runs.

Understanding this pipeline is not academic trivia. It directly affects debugging, performance profiling, and your ability to reason about what your code is doing at a low level.

The Global Interpreter Lock: Python’s Most Debated Feature

No discussion of CPython is complete without addressing the GIL — the Global Interpreter Lock. The GIL is a mutex that allows only one thread to execute Python bytecode at a time within a single CPython process. It has been part of CPython since the 1990s.

The GIL simplifies CPython’s memory management by making reference counting — CPython’s primary garbage collection mechanism — thread-safe without requiring fine-grained locking on every object. Every Python object carries a reference count, and when that count reaches zero, the object is immediately deallocated. Without the GIL, every increment and decrement of every reference count would need its own lock, introducing significant overhead.

In a 2022 interview on the Lex Fridman Podcast, van Rossum acknowledged the tension directly: concurrency bugs are harder to diagnose, and while free threading would be a valuable capability, it would also cause more software bugs for developers who are not careful about thread safety.

For I/O-bound tasks — web servers waiting on network responses, file operations, database queries — the GIL is rarely a bottleneck because it is released during I/O operations. The real pain comes with CPU-bound parallel workloads that could benefit from multiple cores.

Here is a concrete demonstration of the GIL’s effect:

import threading
import time

def cpu_bound_work(n):
    """Pure Python CPU-bound task."""
    total = 0
    for i in range(n):
        total += i * i
    return total

# Sequential execution
start = time.perf_counter()
cpu_bound_work(10_000_000)
cpu_bound_work(10_000_000)
sequential_time = time.perf_counter() - start

# Threaded execution (GIL prevents true parallelism)
start = time.perf_counter()
t1 = threading.Thread(target=cpu_bound_work, args=(10_000_000,))
t2 = threading.Thread(target=cpu_bound_work, args=(10_000_000,))
t1.start(); t2.start()
t1.join(); t2.join()
threaded_time = time.perf_counter() - start

print(f"Sequential: {sequential_time:.2f}s")
print(f"Threaded:   {threaded_time:.2f}s")
# On standard CPython, threaded time is similar to or even
# SLOWER than sequential due to GIL contention overhead

Run this on standard CPython and the threaded version will not be faster. The GIL prevents true parallel execution of Python bytecode. This is precisely the problem that recent developments are working to solve.

PEP 703: Making the GIL Optional

The effort to remove the GIL has a long history. In 2016, Larry Hastings presented his “GIL-ectomy” research at the Python Language Summit. In 2021, Sam Gross at Meta reignited the discussion with a working prototype of a GIL-free CPython fork that demonstrated acceptable single-threaded performance.

That work culminated in PEP 703 — Making the Global Interpreter Lock Optional in CPython, authored by Gross. The Python Steering Council accepted it with a careful, phased approach. As stated in the Steering Council’s acceptance, the rollout should be gradual and break as little as possible, with the possibility of rolling back changes that turn out to be too disruptive.

The implementation required solving serious technical challenges. Without the GIL, CPython needed biased reference counting (where the owning thread uses cheap non-atomic operations), a thread-safe memory allocator (mimalloc, developed by Daan Leijen at Microsoft), and thread-safe standard collections — all without destroying single-threaded performance.

Python 3.13, released in October 2024, shipped with the free-threaded build as an experimental option, requiring a separate executable (typically python3.13t) and the --disable-gil build flag.

Then came PEP 779 — Criteria for Supported Status for Free-Threaded Python, which defined the requirements for moving from experimental to officially supported. The Steering Council accepted PEP 779 in June 2025. In Python 3.14, released in October 2025, free-threaded CPython reached Phase II — officially supported, though still optional and not the default build.

In an interview published by ODBMS Industry Watch in October 2025, van Rossum offered a characteristically candid assessment of the GIL removal effort: he considers the importance of the GIL removal project to have been overstated, arguing that while it serves the needs of large users such as Meta, it complicates things for potential contributors to the CPython codebase.

Here is how you can experiment with the free-threaded build today:

import sys
import sysconfig

# Check if you are running the free-threaded build
gil_disabled = sysconfig.get_config_var("Py_GIL_DISABLED")
print(f"Free-threaded build: {bool(gil_disabled)}")

# On a free-threaded build, check the GIL status
if hasattr(sys, '_is_gil_enabled'):
    print(f"GIL currently enabled: {sys._is_gil_enabled()}")

PEP 659: The Specializing Adaptive Interpreter

While the GIL story captures headlines, a quieter revolution has been delivering real, measurable performance gains to every CPython user since Python 3.11. That revolution is the specializing adaptive interpreter, described in PEP 659, authored by Mark Shannon.

Shannon proposed the PEP in May 2021 on the python-dev mailing list, framing it as a key part of the plan to improve CPython performance for 3.11 and beyond. The core insight is elegant: Python is a dynamic language, but in practice, many operations are type-stable. A + operation that adds two integers a thousand times is overwhelmingly likely to keep adding integers. So why pay the full cost of dynamic dispatch every single time?

The adaptive interpreter watches your code as it runs. After a few executions of a function, it “quickens” the bytecode, replacing generic instructions with adaptive versions. These adaptive instructions then specialize themselves based on the types they observe:

# When CPython sees this repeatedly called with floats:
def distance(x1, y1, x2, y2):
    return ((x2 - x1) ** 2 + (y2 - y1) ** 2) ** 0.5

# The BINARY_OP instructions specialize to BINARY_OP_SUBTRACT_FLOAT
# and BINARY_OP_ADD_FLOAT, skipping expensive type lookups

You can observe this specialization in action:

import dis

def tight_loop():
    total = 0.0
    for i in range(1000):
        total += float(i)
    return total

# Run it enough times to trigger specialization
for _ in range(50):
    tight_loop()

# Now inspect the specialized bytecode
dis.dis(tight_loop, adaptive=True)
# You will see instructions like BINARY_OP_ADD_FLOAT
# instead of the generic BINARY_OP

If the types change — say someone passes a string instead of a float — the specialized instruction fails its type check, decrements a counter, and if it fails enough times, de-specializes back to the adaptive version. Then it can re-specialize for the new pattern. This adaptivity is what makes the approach safe for a dynamic language.

The results speak for themselves. Python 3.11 was measured as approximately 25% faster than Python 3.10 on the pyperformance benchmark suite. As PEP 659 itself notes, extensive experimentation suggests speedups of up to 50%, though real-world gains vary with workload. Even the conservative 25% average is a substantial improvement delivered to every CPython user without requiring a single line of code change.

PEP 744: The Copy-and-Patch JIT Compiler

The specializing adaptive interpreter laid the groundwork for something even more ambitious: a just-in-time compiler. PEP 744 — JIT Compilation, authored by Brandt Bucher and Savannah Ostrowski, describes the experimental JIT that was merged into CPython’s main development branch in early 2024.

The JIT uses a technique called copy-and-patch compilation, originally described in a 2021 academic paper. Here is the idea in simplified terms: at CPython’s build time, LLVM compiles each micro-op instruction into a blob of machine code with “holes” left for runtime values. At execution time, when CPython identifies a hot code path, it “copies” the appropriate machine code blobs and “patches” in the runtime-specific values. The result is native machine code, assembled without the overhead of running a full compiler at runtime.

As PEP 744 states, copy-and-patch allows a high-quality template JIT compiler to be generated from the same DSL used to generate the rest of the interpreter — a particularly valuable property for a volunteer-driven project like CPython.

The execution pipeline now looks like this:

Source Code
    |
    v
[Tier 0] Standard Bytecode
    |  (after repeated execution)
    v
[Tier 1] Specialized Bytecode (PEP 659)
    |  (hot path detected)
    v
[Tier 2] Micro-ops (optimized)
    |  (if JIT enabled)
    v
[JIT] Native Machine Code (copy-and-patch)

The JIT remains experimental and is not enabled by default. Progress has been honest but cautious. Ken Jin, a CPython core developer working on the JIT optimizer, acknowledged in a July 2025 analysis on devclass.com that after over two years of work, JIT performance ranges from slower than the interpreter to roughly equivalent, depending on the compiler used and the workload. Python 3.15 alphas show modest geometric-mean improvements of roughly 3–8% depending on the platform and compiler.

A significant blow came in May 2025 when Microsoft laid off members of its Faster CPython team. At PyCon US 2025 in Pittsburgh, Brandt Bucher delivered his talk on the JIT just days after the layoffs. Despite the setback, Bucher made clear publicly that he intends to continue the work as a community-driven project.

Caution

Because the JIT is not enabled by default, you must build CPython with --enable-experimental-jit or use an official binary that includes it (macOS and Windows release binaries for 3.14+ include it experimentally) to test it. Do not assume JIT-related performance improvements without benchmarking your specific workload.

CPython’s Memory Model: Reference Counting and Garbage Collection

Understanding CPython’s memory management is essential for writing efficient Python. CPython uses a hybrid approach: primary reference counting supplemented by a cyclic garbage collector.

Every Python object in CPython contains a reference count:

import sys

a = []           # Reference count: 1 (variable 'a')
b = a            # Reference count: 2 (variables 'a' and 'b')
del a            # Reference count: 1 (only 'b' remains)
print(sys.getrefcount(b))  # Prints 2 (b + temporary reference from getrefcount)
del b            # Reference count: 0 -> immediately deallocated

This immediate deallocation is a feature, not just an implementation detail. It makes CPython’s memory behavior more predictable than tracing garbage collectors used by Java or Go, which may delay collection. Files get closed when their objects go out of scope (though you should still use context managers), and memory is freed promptly.

The cyclic garbage collector handles the one case reference counting cannot: circular references.

# This creates a reference cycle
a = []
b = []
a.append(b)
b.append(a)
del a, b
# Reference counts never reach zero because a references b
# and b references a. The cyclic GC detects and collects these.

You can interact with the garbage collector directly:

import gc

# Force a collection
gc.collect()

# Check garbage collection stats
print(gc.get_stats())

# Disable automatic collection (for performance-critical sections)
gc.disable()
# ... do work ...
gc.enable()

CPython vs. Alternative Implementations: When to Choose What

CPython is the default choice for good reasons: maximum compatibility with C extensions, the largest ecosystem of packages, and predictable behavior. But alternative implementations serve important niches.

PyPy uses a tracing JIT compiler and can be dramatically faster for long-running, CPU-bound pure Python code. If your workload is computation-heavy and you do not depend on C extensions that are incompatible with PyPy, it is worth benchmarking.

Cython is not a separate interpreter but a compiler that translates Python-like code into C extensions that run on CPython. It lets you add static type declarations to get C-like performance while staying in the CPython ecosystem:

# example.pyx -- Cython code
def fast_sum(int n):
    cdef int i
    cdef long total = 0
    for i in range(n):
        total += i * i
    return total

Mojo is a newer language described as almost a superset of Python that compiles to native code, targeting AI/ML workloads where Python’s performance overhead is a bottleneck.

The key question is always: does your workload justify leaving the CPython ecosystem, where nearly every Python package is guaranteed to work?

What Is Coming: Python 3.14, 3.15, and Beyond

Python 3.14, released in October 2025, brought several important CPython changes. The free-threaded build is now officially supported under Phase II of PEP 703. The concurrent.interpreters module makes subinterpreters — isolated copies of the Python runtime that can run in parallel within the same process — accessible from Python code for the first time, after being C-API-only for over 20 years. This is a significant concurrency model that sidesteps the GIL entirely. PEP 768 also landed in 3.14, adding a safe, zero-overhead external debugger interface that allows tools like pdb to attach to a running Python process by PID without restarting it — similar to how gdb -p works for C programs.

Pro Tip

The PEP 768 debugger interface is already available in Python 3.14. You can attach pdb to a live process with python -m pdb -p <pid> — no code modification, no process restart needed. This is especially useful for diagnosing deadlocks or unexpected behavior in production services.

Python 3.15, currently in alpha development with a stable release expected in October 2026, brings the Tachyon statistical sampling profiler under profiling.sampling — a new zero-overhead profiler that uses statistical sampling rather than deterministic tracing, making it suitable for production environments. PEP 799, still in discussion, proposes reorganizing CPython’s profiling tools into a single profiling namespace (moving cProfile to profiling.tracing and Tachyon to profiling.sampling) and deprecating the legacy profile module. Ongoing work also continues on the JIT compiler and free-threading ecosystem adoption.

At the Python Language Summit 2025, van Rossum delivered a lightning talk reflecting on how Python’s development has changed. He recalled that early Python was governed by the “worse is better” philosophy — shipping imperfect features to get community feedback — and questioned whether that approach can survive in an era of features that take years to produce from teams of software developers paid by large technology companies.

Practical Takeaways

Understanding CPython is not merely theoretical. Here are concrete ways this knowledge improves your daily Python work.

Profile before optimizing. Use dis.dis() with the adaptive=True flag to see whether your hot loops are getting specialized. If a BINARY_OP is not specializing, your types may be inconsistent — stabilizing them can yield free speedups.

Use sys.getrefcount() and gc for debugging memory leaks. If your application’s memory grows over time, circular references are a common cause. The gc module’s get_referrers() and get_referents() functions help trace object graphs.

Understand the GIL for your concurrency model. For I/O-bound work, threading is fine because the GIL is released during I/O. For CPU-bound parallelism on standard CPython, use multiprocessing or concurrent.futures.ProcessPoolExecutor. On free-threaded CPython 3.14+, true thread-based CPU parallelism becomes viable.

Stay current with CPython releases. Each annual release since 3.11 has delivered meaningful performance improvements that require zero code changes. Simply upgrading your Python version is one of the easiest optimizations available.

# Quick check: what CPython version and implementation are you running?
import sys
import platform

print(f"Implementation: {platform.python_implementation()}")
print(f"Version: {sys.version}")
print(f"Compiler: {platform.python_compiler()}")

CPython is not just “the thing that runs Python.” It is a sophisticated, evolving runtime that reflects decades of design decisions, community debate, and engineering trade-offs. Understanding how it compiles, specializes, and executes your code makes you a better Python developer — not because you need to micro-optimize every line, but because understanding your tools is how you build real mastery.

The engine is getting faster. The GIL is becoming optional. A JIT compiler is taking shape. And the language that started as a Christmas break hobby project now powers everything from AI research to web infrastructure. The story of CPython is still being written — and every Python developer is part of it.