Every Python object -- every integer, every string, every list -- lives in memory that was allocated, tracked, and will eventually be reclaimed. Python hides this complexity behind clean syntax, but the machinery underneath is sophisticated, opinionated, and has real consequences for how your code performs. Understanding that machinery is the difference between writing code that works and writing code that works well.
This article covers how CPython actually manages memory: the allocator hierarchy, reference counting, the generational garbage collector, the PEPs that shaped the system, object interning strategies, production profiling tools, and the ongoing free-threading transition. Real code, real internals, verifiable sources, no hand-waving.
Everything Is an Object, and Every Object Is a C Struct
Before anything else, internalize this: in CPython, every Python object is a C struct that starts with a reference count and a type pointer. When you write x = 42, CPython doesn't just store the number 42 somewhere. It allocates a PyObject (specifically a PyLongObject) on the heap, sets its reference count to 1, and points x at it.
You can see this machinery directly:
import sys
x = "hello"
print(sys.getrefcount(x)) # Likely 2 (x + the argument to getrefcount)
print(sys.getsizeof(x)) # 54 bytes on CPython 3.12+ (64-bit)
y = x
print(sys.getrefcount(x)) # Now 3 (x, y, and the argument)
# Even "simple" objects have weight
# (Exact sizes shown here are from CPython 3.12+ on a 64-bit platform;
# values may differ slightly across versions and architectures.)
print(sys.getsizeof(42)) # 28 bytes
print(sys.getsizeof([])) # 56 bytes (empty list!)
print(sys.getsizeof({})) # 64 bytes (empty dict!)
print(sys.getsizeof("")) # 49 bytes (empty string!)
That empty string consuming 49 bytes is not a bug. It's the overhead of being a full Python object: reference count, type pointer, hash cache, length, and the string's internal representation. This per-object overhead is a core characteristic of CPython, and it directly motivates the custom allocator that sits underneath everything.
The Three-Layer Allocator: Arenas, Pools, and Blocks
CPython doesn't call the operating system's malloc for every object. Doing so would be catastrophically slow -- Python programs create and destroy enormous numbers of small objects, and each malloc call has overhead from the OS kernel. Instead, CPython uses a three-tier custom allocator called pymalloc, which has been the default allocator for object allocation (PyObject_Malloc) since Python 2.3 and for general memory allocation (PyMem_Malloc) since Python 3.6.
The CPython source code describes pymalloc as a specialized allocator optimized for small blocks that sits on top of the system's general-purpose malloc. The allocator handles all requests for objects up to 512 bytes. Anything larger gets routed directly to the system's malloc.
The hierarchy works like this:
Arenas are the top level. When pymalloc needs fresh memory from the OS, it requests it in large chunks called arenas. On 64-bit platforms (which cover virtually all modern systems), arenas are 1 MiB. On 32-bit platforms, they are 256 KiB. These are the only unit of memory that pymalloc allocates from (and can return to) the operating system. This distinction matters: if you're reading older documentation or tutorials that cite 256 KiB unconditionally, they predate the 64-bit arena expansion that arrived during the Python 3.10 development cycle as part of the radix tree rework of obmalloc (see the Python memory management documentation and CPython PR #14474).
Pools subdivide arenas. Each arena is divided into 4 KiB pools (matching the typical OS page size). Each pool is dedicated to blocks of a single size class.
Blocks are the actual units handed out to Python objects. Block sizes are multiples of the platform alignment: 16 bytes on 64-bit systems (since Python 3.8) or 8 bytes on 32-bit systems, up to a maximum of 512 bytes. When you create a small object, pymalloc rounds up to the appropriate size class and returns a block from a pool dedicated to that size.
Here's the size class table for a 64-bit platform:
# Request size -> Block size -> Size class index (64-bit)
# 1-16 bytes -> 16 bytes -> 0
# 17-32 bytes -> 32 bytes -> 1
# 33-48 bytes -> 48 bytes -> 2
# 49-64 bytes -> 64 bytes -> 3
# ...
# 497-512 -> 512 bytes -> 31
This design is fast because the common case -- "give me a block from a pool that already has free blocks of the right size" -- requires only a couple of pointer operations. CPython maintains a usedpools array that maps each size class to a linked list of pools with available blocks. Grabbing a block off that list is essentially constant time.
You can observe pymalloc's behavior using sys._debugmallocstats():
import sys
sys._debugmallocstats()
This prints detailed statistics about arena utilization, pool allocation counts per size class, and the number of allocated versus free blocks. It's an underused diagnostic tool.
The Arena Problem: Memory That Doesn't Come Back
There's a well-known consequence of this architecture: pymalloc can hold onto memory even after your objects are freed. A block can only be returned to its pool when the object using it is deallocated. A pool can only be marked empty when all its blocks are free. And an arena can only be returned to the OS when every pool in it is empty.
If even one small object in an arena is still alive, the entire arena stays allocated -- that's 1 MiB on a 64-bit system or 256 KiB on a 32-bit system. This is why Python processes sometimes appear to "leak" memory: the OS-level memory usage doesn't shrink after a burst of allocations, even though Python has freed the objects internally. The memory is available for reuse by pymalloc, but it's not returned to the operating system.
Reference Counting: Python's Primary Memory Manager
Reference counting is the workhorse of CPython's memory management. It's not optional, not configurable, and it handles the vast majority of object deallocation. Every PyObject in CPython carries an ob_refcnt field. When a new reference to the object is created, the count increments. When a reference is removed, it decrements. When the count hits zero, the object is deallocated immediately.
import sys
a = [1, 2, 3]
print(sys.getrefcount(a)) # 2 (a + getrefcount arg)
b = a
print(sys.getrefcount(a)) # 3 (a, b, + getrefcount arg)
c = [a, a, a]
print(sys.getrefcount(a)) # 6 (a, b, three list entries, + getrefcount arg)
del b
print(sys.getrefcount(a)) # 5
c.clear()
print(sys.getrefcount(a)) # 2 (back to a + getrefcount arg)
Reference counting has a property that makes it compelling: deterministic destruction. The moment the last reference to an object disappears, that object is freed. No waiting, no background process, no pauses. This is why file handles, database connections, and other resources often get cleaned up reliably in CPython even without explicit close() calls or context managers -- though you should still use them for portability.
But reference counting has a fatal flaw: it cannot detect reference cycles.
# This creates a cycle that reference counting alone cannot free
def create_cycle():
a = {}
b = {}
a["ref"] = b
b["ref"] = a
# When create_cycle returns, a and b go out of scope
# But each still references the other, so refcount never hits 0
After create_cycle() returns, both dictionaries are unreachable from Python code, but each holds a reference to the other. Their reference counts are both 1, not 0. Reference counting alone will never free them. This is where the garbage collector comes in.
The Generational Garbage Collector: Catching What Refcounting Misses
CPython's cyclic garbage collector exists for one specific purpose: detecting and breaking reference cycles among container objects. It does not replace reference counting -- it supplements it. As developer Artem Golubin has explained, CPython's GC handles reference cycles only, while reference counting remains fundamental and cannot be disabled (Source: rushter.com/blog/python-garbage-collector).
The GC uses a generational strategy rooted in the observation that newly created objects tend to die young. Historically, CPython organized objects into three generations (0, 1, and 2). In Python 3.14, the GC was simplified to two generations: young and old. New objects start in the youngest generation. If they survive a collection cycle, they're promoted to the older generation.
Python 3.14 restructured the GC from three generations to two (young and old) with incremental collection of the old generation. For backward compatibility, gc.get_threshold() still returns a three-item tuple, but the default is now (700, 10, 0). The parameters have new semantics: threshold0 still triggers collection, threshold1 determines the rate at which the old generation is scanned (with higher values meaning the old generation is scanned more slowly -- the default of 10 means roughly 1% of the old generation per collection), and threshold2 is meaningless and always zero. set_threshold() now ignores any items after the second. The gc.get_count() and gc.get_stats() functions also still return three-element results for backward compatibility, but the values now refer to the young generation and the aging and collecting spaces of the old generation rather than three separate generations. Code that tuned GC behavior via gc.set_threshold() will still work, but the third parameter has no effect. (Sources: Python 3.14 What's New, Python 3.14 gc docs, CPython GC internals)
import gc
# Check the current collection thresholds
print(gc.get_threshold()) # Default: (700, 10, 0)
# Check how many allocations/deallocations have occurred
print(gc.get_count()) # (count0, count1, count2)
The default threshold of 700 means: when the number of allocations minus deallocations since the last collection exceeds 700, collection starts. In Python 3.13 and earlier, the second and third values (10, 10) controlled how often older generations were collected relative to younger ones. In Python 3.14, the second value determines the rate at which the old generation is scanned (higher values mean slower scanning -- the default of 10 means roughly 1% of the old generation is included in each incremental collection), and the third value is meaningless and always zero.
The cycle detection algorithm itself is clever. For each object in the generation being collected, the GC temporarily subtracts all internal references (references between objects within the same generation). Objects whose effective reference count drops to zero after this subtraction are unreachable -- they're only kept alive by each other, not by anything outside the cycle. Those objects get collected.
Only container objects -- lists, dicts, sets, tuples, instances of user-defined classes -- participate in the GC. Primitive types like int, float, and str cannot contain references to other objects, so they can never form cycles and are never tracked. This is an important optimization.
import gc
# See what the GC is tracking
gc.collect() # Force a full collection
tracked = gc.get_objects()
print(f"GC is tracking {len(tracked):,} objects")
# Demonstrate manual cycle cleanup
class Node:
def __init__(self):
self.ref = None
a = Node()
b = Node()
a.ref = b
b.ref = a
del a
del b
# The nodes are unreachable but still alive
print(f"Collected: {gc.collect()} objects") # Should collect 2
PEPs That Shaped the Memory Landscape
Several Python Enhancement Proposals have fundamentally changed how memory management works in CPython. Understanding them gives you insight into both current behavior and where things are headed.
PEP 442: Safe Object Finalization (Python 3.4)
Before PEP 442, authored by Antoine Pitrou, objects with __del__ methods that were caught in reference cycles could never be collected. They'd end up in gc.garbage, a list of uncollectable objects that would grow indefinitely in long-running processes. The pre-3.4 documentation explicitly warned that finalizable objects in reference cycles were problematic (Source: PEP 442).
PEP 442 fixed this by redesigning how the GC handles finalization. Under the new scheme, the GC calls finalizers on objects in cyclic isolates before breaking the cycle, while the objects are still in a usable state. This meant that generators with finally blocks, objects with __del__ methods, and similar constructs could finally participate in cycles without causing leaks.
PEP 454: tracemalloc (Python 3.4)
PEP 454 added the tracemalloc module to the standard library. The motivation was straightforward: existing tools like Valgrind couldn't meaningfully profile Python memory allocations because they operate at the C level. Since nearly all Python object allocations funnel through the same few internal C functions, those tools produce tracebacks that point to pymalloc internals rather than identifying which Python source code is responsible for the allocation (Source: PEP 454).
The tracemalloc module solves this by tracking allocations at the Python source level -- file and line number. More on this in the practical tools section below.
PEP 683: Immortal Objects (Python 3.12)
PEP 683, authored by Eric Snow and Eddie Elizondo, introduced a fundamental change to reference counting. Under this PEP, certain objects -- None, True, False, built-in types, and small integers -- can be marked as "immortal." Their reference count is set to a special value that Py_INCREF and Py_DECREF recognize and skip.
The immediate benefit is for copy-on-write performance in forked processes. When a child process increments the reference count of None in the parent's shared memory, that triggers a copy-on-write page fault. With immortal objects, the refcount never changes, so the memory page stays shared.
The Instagram Story: When the GC Works Against You
One of the instructive real-world memory management case studies comes from Instagram's engineering team. In 2017, they published a post describing how they achieved a 10% global capacity improvement by disabling the cyclic garbage collector (Source: Instagram Engineering blog).
Instagram's web server runs Django in a multi-process mode using uWSGI with pre-fork. A master process forks itself to create worker processes that handle requests. Under Linux, forked processes share memory with the parent through copy-on-write (CoW): memory pages stay shared until one process writes to them, at which point the kernel copies the page.
The engineering team noticed that worker processes' memory usage spiked immediately after forking. Using perf to trace page faults, they found the culprit: the garbage collector's collect function. Every GC run traverses and reorganizes linked lists embedded in tracked objects (the PyGC_Head structure). Modifying those lists means writing to memory pages that contain those objects, which triggers copy-on-write for every affected page.
Their fix was two lines of code:
# gc.disable() doesn't work, because some random 3rd-party library will
# enable it back implicitly.
gc.set_threshold(0)
# Suicide immediately after other atexit functions finish.
# CPython will do a bunch of cleanups in Py_Finalize which
# will again cause Copy-on-Write, including a final GC
atexit.register(os._exit, 0)
The team chose gc.set_threshold(0) over gc.disable() because third-party libraries can call gc.enable() implicitly. Setting the threshold to zero effectively prevents collection while being immune to those calls.
With the GC disabled, the CPU LLC (last-level cache) hit ratio improved, translating into the 10% capacity win. However, memory growth from actual reference cycles became a problem as the codebase grew. Their follow-up work contributed the gc.freeze() function (landed in Python 3.7), which moves all tracked objects into a permanent generation that the GC ignores. This allows GC to run on new objects without touching the shared pre-fork memory (Source: Instagram Engineering, "Copy-on-write friendly Python garbage collection").
If you're running a multi-process Python server with pre-fork (gunicorn, uWSGI), call gc.freeze() in the master process before forking. Workers inherit frozen objects that the GC will never traverse, keeping those memory pages shared and avoiding unnecessary copy-on-write faults.
Practical Tools: Seeing Memory in Action
tracemalloc: Finding Where Memory Is Allocated
The tracemalloc module is the standard library's built-in memory profiler. It tracks allocations at the Python source level:
import tracemalloc
tracemalloc.start()
# Your code here
data = [dict(key=i, value=str(i) * 100) for i in range(10000)]
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
print("[ Top 5 memory consumers ]")
for stat in top_stats[:5]:
print(stat)
The real power comes from comparing snapshots to find leaks:
import tracemalloc
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
# Simulate work that might leak
cache = {}
for i in range(10000):
cache[i] = [0] * 100
snapshot2 = tracemalloc.take_snapshot()
# Show what grew between snapshots
top_stats = snapshot2.compare_to(snapshot1, "lineno")
print("[ Top 5 differences ]")
for stat in top_stats[:5]:
print(stat)
# Also check peak memory
current, peak = tracemalloc.get_traced_memory()
print(f"\nCurrent: {current / 1024:.1f} KiB")
print(f"Peak: {peak / 1024:.1f} KiB")
sys.getsizeof: Measuring Individual Objects
sys.getsizeof returns the size of an individual object in bytes. But be careful -- it doesn't follow references:
import sys
# getsizeof shows the container, not the contents
lst = [1, 2, 3, 4, 5]
print(f"List object: {sys.getsizeof(lst)} bytes")
print(f"One int: {sys.getsizeof(1)} bytes")
# To get total size including contents, you need to recurse
def deep_getsizeof(obj, seen=None):
"""Recursively calculate total size of an object and its contents."""
if seen is None:
seen = set()
obj_id = id(obj)
if obj_id in seen:
return 0
seen.add(obj_id)
size = sys.getsizeof(obj)
if isinstance(obj, dict):
size += sum(deep_getsizeof(k, seen) + deep_getsizeof(v, seen)
for k, v in obj.items())
elif isinstance(obj, (list, tuple, set, frozenset)):
size += sum(deep_getsizeof(item, seen) for item in obj)
return size
data = {"key": [1, 2, 3], "other": "hello"}
print(f"Shallow: {sys.getsizeof(data)} bytes")
print(f"Deep: {deep_getsizeof(data)} bytes")
The gc Module: Controlling Collection
import gc
# See current GC stats
print(f"Enabled: {gc.isenabled()}")
print(f"Thresholds: {gc.get_threshold()}")
print(f"Counts: {gc.get_count()}")
# Force collection and see how many objects were freed
collected = gc.collect()
print(f"Collected {collected} unreachable objects")
# Debug mode: find what the GC is collecting
gc.set_debug(gc.DEBUG_STATS)
gc.collect()
gc.set_debug(0) # Turn off debug output
# Freeze pre-fork objects (Python 3.7+, useful for multiprocess servers)
gc.freeze()
# After fork: gc.unfreeze() if you want to collect those objects later
Reducing Memory: Patterns That Work
__slots__: Eliminating Per-Instance Dictionaries
By default, every instance of a user-defined class carries a __dict__ -- a full dictionary for storing attributes. For classes with many instances, this is expensive:
import sys
class RegularPoint:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
class SlottedPoint:
__slots__ = ("x", "y", "z")
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
regular = RegularPoint(1.0, 2.0, 3.0)
slotted = SlottedPoint(1.0, 2.0, 3.0)
print(f"Regular: {sys.getsizeof(regular)} bytes + __dict__: {sys.getsizeof(regular.__dict__)} bytes")
print(f"Slotted: {sys.getsizeof(slotted)} bytes (no __dict__)")
The savings per instance typically run 40-60%, and for classes with millions of instances (graph nodes, data records, parsed tokens), this adds up to hundreds of megabytes.
Generators Instead of Lists
import tracemalloc
tracemalloc.start()
# This builds the entire list in memory
squares_list = [x ** 2 for x in range(1_000_000)]
_, peak_list = tracemalloc.get_traced_memory()
tracemalloc.reset_peak()
del squares_list
# This yields one value at a time
squares_gen = (x ** 2 for x in range(1_000_000))
# Consume it
total = sum(squares_gen)
_, peak_gen = tracemalloc.get_traced_memory()
print(f"List comprehension peak: {peak_list / 1024:.1f} KiB")
print(f"Generator expression peak: {peak_gen / 1024:.1f} KiB")
The list materializes a million integers in memory. The generator expression holds only one at a time. For processing pipelines where you don't need random access to all elements simultaneously, this is one of the most effective memory optimizations available.
Weakref: References That Don't Keep Objects Alive
When you need to reference an object without preventing its collection, weakref breaks the reference counting chain:
import weakref
class ExpensiveResource:
def __init__(self, name, data_size):
self.name = name
self.data = bytearray(data_size)
def __repr__(self):
return f"ExpensiveResource({self.name!r})"
# Build a cache using weak references
cache = weakref.WeakValueDictionary()
def get_resource(name):
resource = cache.get(name)
if resource is not None:
print(f"Cache hit: {name}")
return resource
print(f"Cache miss: {name}, creating new")
resource = ExpensiveResource(name, 1_000_000)
cache[name] = resource
return resource
r1 = get_resource("dataset_a") # Cache miss
r2 = get_resource("dataset_a") # Cache hit
del r1, r2 # Both references gone
# The resource can now be garbage collected
# and the cache entry disappears automatically
r3 = get_resource("dataset_a") # Cache miss again
WeakValueDictionary holds weak references to its values. When the last strong reference to a value disappears, the entry is automatically removed from the dictionary. This is invaluable for caches where you don't want the cache itself to prevent garbage collection of objects that nothing else needs.
The Free-Threading Future
Python 3.13 introduced an experimental free-threaded build (PEP 703) that removes the Global Interpreter Lock. With the acceptance of PEP 779 for Python 3.14, the free-threaded build has moved from experimental to officially supported status (phase II of the PEP 703 roadmap), though it is not yet the default interpreter build (Source: py-free-threading.github.io).
This has profound implications for memory management. The free-threaded build replaces pymalloc with mimalloc, a general-purpose thread-safe allocator, as specified in PEP 703. The reason is architectural: pymalloc was never designed for concurrent access from multiple threads without the GIL protecting it. mimalloc is designed for thread safety from the ground up, and its internal heap structure also enables the GC to find all Python objects by traversing mimalloc's own data structures rather than maintaining a separate linked list -- a significant simplification (Source: PEP 703).
In the free-threaded build, reference counting itself had to be reworked. Traditional Py_INCREF and Py_DECREF are not atomic operations -- under the GIL, they didn't need to be. Without the GIL, reference count modifications must use atomic operations to prevent data races. The free-threaded build also uses biased reference counting, where each thread maintains local reference counts for objects it owns, avoiding contention on a single shared counter.
The performance cost is measurable but narrowing. PEP 779 proposes a hard target of no more than 15% single-threaded performance penalty and no more than 20% higher memory usage (geometric mean, as measured by pyperformance) for the free-threaded build. According to the Python 3.14 What's New documentation, the actual single-threaded penalty is now roughly 5-10% depending on platform and compiler, well within these targets (Sources: PEP 779, Python 3.14 What's New).
This is where CPython's memory management is headed: toward a world where the allocator, the reference counting system, and the garbage collector all need to work correctly under true parallelism -- and where the trade-offs between memory overhead and thread safety become a first-class engineering concern.
Object Interning and Small Object Caching
CPython employs several caching strategies that directly affect memory behavior in ways that surprise developers who aren't expecting them.
Integer caching: CPython pre-allocates and caches all integers from -5 to 256. Every variable that holds one of these values points to the same object in memory. This is why a = 100; b = 100; a is b returns True, but a = 1000; b = 1000; a is b may return False. The cached range was chosen based on empirical analysis of common integer usage in real programs.
import sys
# Cached integers share identity
a = 256
b = 256
print(a is b) # True -- same object
print(id(a) == id(b)) # True
# Outside the cache, new objects are created
c = 257
d = 257
print(c is d) # False in the REPL; may be True in a compiled .py file
# due to the compiler's constant folding optimization
# The memory cost of uncached integers adds up
print(sys.getsizeof(0)) # 28 bytes
print(sys.getsizeof(10**100)) # Much larger -- big integers grow dynamically
String interning: CPython automatically interns certain strings, particularly those that resemble identifiers (alphanumeric characters and underscores only). Interned strings are stored in a global table and reused, which accelerates dictionary lookups -- since interned strings can be compared by pointer identity rather than character-by-character comparison. You can explicitly intern strings using sys.intern(), which is useful for applications that create large numbers of repeated string values (parsers, data pipelines, configuration systems).
import sys
# Identifier-like strings get interned automatically
s1 = "hello_world"
s2 = "hello_world"
print(s1 is s2) # True -- automatically interned
# Strings with spaces or special characters are not interned by default
s3 = "hello world"
s4 = "hello world"
print(s3 is s4) # May be False
# Explicit interning for repeated strings in data-heavy applications
s5 = sys.intern("my_repeated_key")
s6 = sys.intern("my_repeated_key")
print(s5 is s6) # Always True
Free lists: Several built-in types maintain their own internal free lists of recently deallocated objects. When a float, tuple, or list object is freed, instead of returning the memory to pymalloc, it gets placed on a type-specific free list. The next time an object of that type is created, it can be pulled from the free list rather than going through the allocator. This is fast, but it means memory used by one type can't be reclaimed for use by another type. For example, if you create and destroy a million floats, that memory is reserved for future float allocations -- it won't be used for dictionaries or strings.
Memory Profiling in Production
The standard library tools covered earlier (tracemalloc, sys.getsizeof, gc) are invaluable for development and debugging, but production memory profiling demands additional strategies.
memray: Production-Grade Memory Profiling
memray (developed by Bloomberg's engineering team) is the current standard for production Python memory profiling. Unlike tracemalloc, it tracks all allocations -- including those made by C extensions and native libraries -- by intercepting calls at the C allocator level. It generates flame graphs, temporal allocation charts, and leak reports. One critical detail: when profiling with pymalloc active, small allocations satisfied from existing arenas won't appear in reports at all, and new arena allocations will appear as mmap requests at arena granularity rather than individual object sizes. For precise object-level leak detection, you can disable pymalloc (PYTHONMALLOC=malloc) during profiling, though this changes the allocation behavior and produces slower execution and larger report files (Source: bloomberg.github.io/memray).
# Install: pip install memray
# Profile a script
# $ python -m memray run my_script.py
# $ python -m memray flamegraph memray-my_script.py.bin
# For live monitoring in a running process:
# $ python -m memray run --live my_script.py
# For leak detection:
# $ python -m memray run my_script.py
# $ python -m memray tree --leaks memray-my_script.py.bin
objgraph: Visualizing Object References
When you suspect a reference cycle but can't find it, objgraph lets you visualize the reference graph of specific objects. It's particularly useful for tracking down why a particular object isn't being collected:
# Install: pip install objgraph
import objgraph
# Show what types of objects are growing over time
objgraph.show_growth(limit=10)
# ... do some work ...
objgraph.show_growth(limit=10) # Shows what increased
# Find reference chains keeping an object alive
# Useful for debugging "why won't this object die?"
objgraph.show_backrefs(
objgraph.by_type('MyExpensiveClass')[:1],
max_depth=5,
filename='refs.png'
)
A Decision Framework for Memory Investigation
The choice of tool depends on what question you're asking. If the question is "where is memory being allocated in my Python code?" -- use tracemalloc with snapshot comparison. If the question is "why is my process using so much RSS?" -- use memray, because the answer often involves C extensions or native libraries that tracemalloc doesn't see. If the question is "why isn't this specific object being freed?" -- use objgraph to trace the reference chain back to the root reference holding it alive. And if the question is "is my server's shared memory degrading after forking?" -- monitor /proc/[pid]/smaps and track the Shared_Clean and Private_Dirty counters over time.
Key Takeaways
- The three-tier allocator (arenas, pools, blocks) is why Python doesn't call malloc per-object. pymalloc handles everything under 512 bytes internally, which is fast -- but means freed objects don't always reduce OS-level memory usage. On 64-bit systems, a single surviving object in an arena locks 1 MiB; on 32-bit systems, 256 KiB.
- Reference counting is the primary mechanism, not the GC. The vast majority of objects are freed immediately when their last reference goes away. The cyclic GC exists only to handle the cases reference counting can't: reference cycles.
- The GC's traversal writes cause copy-on-write page faults in forked processes. In multi-process servers,
gc.freeze()before forking keeps pre-fork memory shared and avoids this overhead entirely. - Use
tracemallocfor Python-level allocation tracking andmemrayfor full-stack profiling.tracemallocis built into the standard library and excellent for finding Python-level leaks.memraycatches everything including C extensions and native libraries. __slots__and generators are the two highest-leverage memory optimizations. Slots eliminate per-instance dictionaries (40-60% savings per instance). Generators avoid materializing large sequences in memory entirely.- Object interning and free lists create hidden memory relationships. Integer caching (-5 to 256), string interning, and type-specific free lists mean memory behavior often differs from what a simple object-count analysis would predict.
- Free-threading (PEP 703) is reshaping the memory management landscape. The replacement of pymalloc with mimalloc, atomic reference counting, and biased reference counting represent the largest changes to CPython's memory internals in decades. As of Python 3.14, this path is officially supported.
Python's memory management is not a black box you can safely ignore. The three-tier allocator determines how memory is requested and recycled. Reference counting handles the common case with deterministic, immediate deallocation. The generational garbage collector catches the cycles that reference counting misses. Object interning and free lists optimize for the common case but create subtle memory retention patterns that only become visible at scale. And tools like tracemalloc, memray, objgraph, and the gc module give you visibility into all of it.
The deeper insight is that these systems interact in non-obvious ways. The pymalloc allocator's arena structure means freed objects don't always reduce OS-level memory. Reference counting's per-object writes trigger copy-on-write in forked processes -- a problem that took Instagram's engineering team and eventually the CPython core developers themselves (through PEP 683's immortal objects and gc.freeze()) to solve. The GC's linked list traversal pattern destroys cache locality. Type-specific free lists can hold memory hostage from other types. And now, with the free-threaded build replacing pymalloc with mimalloc and introducing atomic reference counting, the entire memory management substrate is being rebuilt for a concurrent future. Each mechanism has trade-offs, and understanding those trade-offs -- and how they compound -- is what separates code that runs from code that runs well under real-world conditions.