PEP 703: Free Threading in Python -- The End of the GIL, Explained with Real Code

How a 30-year-old lock is finally becoming optional, what that means for your Python programs, and the engineering that made it possible.

For three decades, a single lock buried inside CPython has prevented Python programs from truly using more than one CPU core at a time. That lock -- the Global Interpreter Lock, or GIL -- has been the subject of more debates, failed experiments, and frustrated Stack Overflow answers than perhaps any other feature in the language's history.

PEP 703, authored by Sam Gross (a software engineer at Meta who also works on PyTorch) and sponsored by core developer Lukasz Langa, finally changes that. Accepted by the Python Steering Council on July 28, 2023, PEP 703 makes the GIL optional. Python 3.13 shipped the first experimental free-threaded build in October 2024. Python 3.14, released October 2025, elevated free threading to officially supported status under PEP 779.

What the GIL Actually Does

CPython manages memory through reference counting. Every Python object carries an integer that tracks how many variables currently point to it. When that count drops to zero, the object is deallocated immediately.

Here's the problem: incrementing and decrementing an integer is not inherently thread-safe. If two threads modify the same object's reference count simultaneously without coordination, one update can overwrite the other. The count drifts from reality. Objects get freed while still in use (crash), or never get freed at all (memory leak).

The GIL solves this by brute force. It's a mutex -- a mutual exclusion lock -- that ensures only one thread executes Python bytecode at any given time. Your Python program can spawn 50 threads, but only one of them can run Python code at any given moment. The others wait.

import threading
import time

def cpu_work():
    """Pure CPU-bound task."""
    total = 0
    for i in range(20_000_000):
        total += i
    return total

# Sequential: run task twice, one after the other
start = time.perf_counter()
cpu_work()
cpu_work()
sequential_time = time.perf_counter() - start
print(f"Sequential: {sequential_time:.2f}s")

# Threaded: run task in two threads simultaneously
start = time.perf_counter()
t1 = threading.Thread(target=cpu_work)
t2 = threading.Thread(target=cpu_work)
t1.start(); t2.start()
t1.join(); t2.join()
threaded_time = time.perf_counter() - start
print(f"Threaded:   {threaded_time:.2f}s")

On standard CPython with the GIL, the threaded version is roughly the same speed as the sequential version -- or sometimes slower due to context-switching overhead. Two threads competing for a single lock don't run in parallel. They take turns.

This has been the reality since 1992, when Guido van Rossum first added threading support to Python. As he wrote in a 2007 blog post on artima.com: "I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."

That constraint -- don't make single-threaded code slower -- is what defeated every previous attempt at GIL removal.

Three Decades of Failed Attempts

1996: Greg Stein's patch. The PEP 703 text documents this as the earliest known attempt: "In 1996, Greg Stein published a patch against Python 1.4 that removed the GIL." It worked. But after benchmarking, it slowed down single-threaded execution nearly two-fold. Two CPUs got you slightly more work than one CPU with the GIL. Not enough.

2015: Larry Hastings' Gilectomy. Presented at PyCon 2016, the Gilectomy replaced the GIL with fine-grained locks and explored various approaches to safe reference counting. The project showed promise but broke the existing CPython API, and it achieved rough performance parity only when using around seven CPU cores -- meaning single-threaded code was substantially slower.

2021: Sam Gross's nogil fork. This is where the story changed. Gross, working at Meta, posted to the python-dev mailing list in October 2021 with a proof-of-concept fork of Python 3.9. Lukasz Langa summarized the results: the no-GIL interpreter was 10% faster than 3.9 on the pyperformance benchmark suite. The estimated cost of the GIL removal itself was around 9%, but Gross had bundled in other optimizations that more than offset it.

That proof of concept became PEP 703.

How PEP 703 Works: The Engineering Under the Hood

PEP 703 isn't one clever trick. It's a coordinated set of changes to CPython's internals that, together, make thread-safe execution possible without destroying single-threaded performance. Here are the four pillars.

1. Biased Reference Counting

This is the core innovation. Standard atomic reference counting -- the approach Greg Stein used in 1996 -- requires expensive CPU operations (atomic increments/decrements) on every single reference count change. Since reference counts change constantly in Python, this creates enormous overhead.

Biased reference counting is built on an observation: the vast majority of objects in a Python program are only accessed by the thread that created them. So the system maintains two counts per object: a fast, non-atomic local count for the owning thread, and a slower atomic count for all other threads. Only the cross-thread count requires expensive atomic operations.

Object: some_list

  Owning thread (#1):
    local_refcount = 3      # Fast, non-atomic operations

  Other threads:
    shared_refcount = 1     # Slower, atomic operations

  Total refcount = local + shared = 4

When the owning thread's local count drops to zero, the system merges the counts and checks if the object can be freed. This means the common case -- creating and using an object within a single thread -- is nearly as fast as it was with the GIL.

2. Replacing pymalloc with mimalloc

CPython's internal memory allocator, pymalloc, was never designed for thread safety. PEP 703 replaces it with mimalloc, a general-purpose allocator developed at Microsoft Research.

mimalloc does more than just allocate memory safely across threads. Its internal heap structure allows the garbage collector to find all Python objects without maintaining a separate linked list (which would require its own locking). Its size-class-based allocation also enables lock-free read operations on dictionaries -- a critical optimization since Python uses dictionaries internally for almost everything, from module namespaces to object attributes.

3. Thread-Safe Collections

Under the GIL, Python's built-in containers (lists, dicts, sets) didn't need their own locks. The GIL protected them. Without the GIL, every container operation that modifies shared state must be explicitly safe.

PEP 703 introduces per-object locks on mutable containers and uses lock-free algorithms for read-only access patterns. The PEP text describes the approach for dictionaries: load the version counter, load the backing array, load the item, increment its reference count (if non-zero), verify the item is still there, verify the version counter hasn't changed. If any verification fails, retry.

4. Deferred Reference Counting and Immortal Objects

Some objects -- None, True, False, small integers, interned strings -- are referenced so frequently and from so many threads that even biased reference counting would create contention. PEP 703 makes these objects "immortal": their reference counts are never modified, eliminating the contention entirely.

Other frequently accessed objects (like top-level module functions and code objects) use deferred reference counting, where the interpreter avoids modifying their reference counts during normal execution and instead relies on the garbage collector for cleanup.

What Free Threading Looks Like in Practice

Installing the Free-Threaded Build

Starting with Python 3.13, the official installers for macOS and Windows offer a free-threaded binary. On macOS, you'll get a separate python3.14t executable (the t stands for threading). On Linux, you can compile from source with --disable-gil, or use pyenv:

# Install free-threaded Python via pyenv
pyenv install 3.14.0t

# Or verify your build
python3.14t -c "import sys; print(sys._is_gil_enabled())"
# Output: False

You can also control the GIL at runtime:

# Disable GIL via environment variable
PYTHON_GIL=0 python3.14t my_script.py

# Or via command-line flag
python3.14t -X gil=0 my_script.py

The Benchmark That Matters

Here's a CPU-bound workload that demonstrates the difference:

import threading
import time
import sys

def fibonacci(n):
    """Deliberately naive recursive Fibonacci."""
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

def benchmark(func, arg, n_threads):
    threads = [threading.Thread(target=func, args=(arg,))
               for _ in range(n_threads)]
    start = time.perf_counter()
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    elapsed = time.perf_counter() - start
    return elapsed

# Single-threaded baseline
single = benchmark(fibonacci, 32, 1)
print(f"1 thread:  {single:.3f}s")

# Multi-threaded
for n in [2, 4, 8]:
    multi = benchmark(fibonacci, 32, n)
    speedup = (single * n) / multi
    print(f"{n} threads: {multi:.3f}s  "
          f"(speedup: {speedup:.1f}x vs linear)")

print(f"\nGIL enabled: {sys._is_gil_enabled()}")
Note

On standard Python (GIL enabled), four threads running fibonacci(32) take roughly 4x as long as one thread -- no parallelism at all. On the free-threaded build, four threads complete in roughly the same time as one thread, because all four CPU cores are doing real work simultaneously. The official Python 3.14 documentation confirms the single-threaded performance penalty is now roughly 5-10%, depending on the platform and C compiler used.

What Changes for Your Code

Code That "Just Works"

Pure Python code that doesn't share mutable state between threads works with no changes. Each thread operates on its own data, and the free-threaded interpreter handles the rest.

import threading

def process_chunk(data_chunk):
    """Each thread processes its own chunk. No shared state."""
    return [x ** 2 + x for x in data_chunk]

data = list(range(1_000_000))
chunk_size = len(data) // 4
chunks = [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

results = [None] * 4
def worker(idx):
    results[idx] = process_chunk(chunks[idx])

threads = [threading.Thread(target=worker, args=(i,))
           for i in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Code That Needs Attention

If your code relied on the GIL to protect shared mutable state, you now need explicit synchronization:

import threading

# BEFORE: This was "safe" only because the GIL serialized access
shared_counter = 0

def increment():
    global shared_counter
    for _ in range(1_000_000):
        shared_counter += 1  # NOT SAFE without GIL

# AFTER: Explicit synchronization required
lock = threading.Lock()
shared_counter = 0

def increment_safe():
    global shared_counter
    for _ in range(1_000_000):
        with lock:
            shared_counter += 1
Warning

This isn't new discipline -- it's how concurrent programming works in every other language. The GIL had been masking the need for it. If a C extension isn't explicitly marked as supporting free threading, the interpreter will automatically re-enable the GIL when importing it, falling back to single-threaded behavior.

C Extensions: The Ecosystem Challenge

This is the biggest practical hurdle. C extensions that assumed the GIL was always present may need updates. The PEP introduces a Py_mod_gil slot that extension modules use to declare their threading support:

static struct PyModuleDef_Slot module_slots[] = {
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},  // "I'm thread-safe"
    {0, NULL}
};

As of late 2025, major packages including NumPy have published pre-compiled wheels for the free-threaded build. The community tracks package compatibility at py-free-threading.github.io.

Why This Matters: The ML and Scientific Computing Case

PEP 703's motivation section draws heavily from scientific computing and machine learning. The GIL isn't a problem for the heavy number-crunching that happens inside NumPy or PyTorch (that runs in C/CUDA, releasing the GIL). It's a problem for all the orchestration that happens in Python around those computations.

"We frequently battle issues with the Python GIL... we often see that even with fewer than 10 threads the GIL becomes the bottleneck." — Researchers at DeepMind, cited in PEP 703

Allen Goodman from Genentech's Prescient Design group described the impact of the nogil prototype on their work: it took a single person less than half a working day to adjust the codebase to use the fork, and the results were immediate.

The Three-Phase Rollout

Phase Release Status Details
Phase I Python 3.13 (Oct 2024) Experimental Free-threaded build available, not the default, clearly labeled experimental
Phase II Python 3.14 (Oct 2025) Officially Supported No longer experimental per PEP 779, still optional. Single-threaded overhead within 15%, memory within 20%
Phase III TBD Not yet decided Free threading becomes the default build, eventually the only build
Pro Tip

The Steering Council's original acceptance statement emphasized that the rollout be gradual and break as little as possible, and that they can roll back any changes that turn out to be too disruptive -- including potentially rolling back all of PEP 703 entirely if necessary.

Multiprocessing vs. Free Threading

Experienced Python developers might wonder: didn't multiprocessing already solve this? It did, partially. The multiprocessing module sidesteps the GIL entirely by spawning separate processes, each with its own interpreter and its own GIL.

But processes are heavy. Each one carries a full copy of the Python interpreter (50-100MB of memory). Sharing data between processes requires serialization -- converting Python objects to bytes and back -- which adds latency and CPU overhead. For workloads where the data being shared is small and the computation is large, multiprocessing works well. For workloads with frequent data sharing, fine-grained parallelism, or low-latency requirements, the serialization overhead dominates.

# Multiprocessing: works, but each worker is a separate process
from multiprocessing import Pool

def expensive_calculation(x):
    return sum(i * i for i in range(x))

with Pool(4) as pool:
    results = pool.map(expensive_calculation, [10_000_000] * 4)

# Free threading: same parallelism, shared memory, lower overhead
import threading

results = [None] * 4
def worker(idx, x):
    results[idx] = sum(i * i for i in range(x))

threads = [threading.Thread(target=worker, args=(i, 10_000_000))
           for i in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

The free-threaded approach avoids the serialization overhead entirely. Threads share the same memory space, so passing data between them is essentially free.

Should You Switch Today?

Switch now if you have CPU-bound Python code that would clearly benefit from parallelism, you can test thoroughly, and your dependencies support the free-threaded build.

Wait if your workload is primarily I/O-bound (use asyncio or standard threading instead), you depend on C extensions that haven't been updated, or you can't tolerate the 5-10% single-threaded overhead.

Experiment regardless. Install the free-threaded build alongside your standard Python. Run your test suite. Profile your hot paths. The overhead will continue to shrink with each release, and the ecosystem will continue to catch up.

# Install both side by side
pyenv install 3.14.0
pyenv install 3.14.0t

# Run your tests on both
pyenv shell 3.14.0
python -m pytest tests/

pyenv shell 3.14.0t
python -m pytest tests/

Key Takeaways

  1. The GIL is now optional: PEP 703 makes the Global Interpreter Lock optional in CPython. Python 3.14 ships a free-threaded build that is officially supported under PEP 779.
  2. Biased reference counting is the core innovation: By maintaining separate local and shared reference counts per object, the free-threaded build keeps the common case (single-thread access) fast while making cross-thread access safe.
  3. Single-threaded overhead is 5-10%: Down from approximately 40% in the Python 3.13 experimental build, the performance penalty continues to shrink.
  4. Code without shared mutable state works immediately: Pure Python code where each thread operates on its own data needs no changes. Code that relied on the GIL for implicit synchronization now needs explicit locks.
  5. C extensions need to opt in: Extensions that haven't declared free-threading support will trigger GIL re-enablement automatically, preserving backward compatibility.

PEP 703 isn't just a performance feature. It's a statement about where Python is headed. By making the GIL optional and charting a path toward its eventual removal, the Python core team is ensuring the language can grow with its community rather than constraining it. The 30-year-old lock isn't gone yet. But for the first time, it's genuinely, practically, optional. And for CPU-bound Python code that can use threads, the difference is immediate and measurable: real work, on real cores, in real parallel.

All code examples tested against Python 3.14 (standard and free-threaded builds). PEP references link to the official documents at peps.python.org. Package compatibility tracking available at py-free-threading.github.io.