Python is powerful and readable, but it is not always fast out of the box. The good news is that with a few targeted changes, you can squeeze significant speed improvements out of your Python code without sacrificing the clarity that makes Python worth using in the first place.
This guide covers 10 practical techniques you can apply right now, from everyday habits like choosing the right data structure, all the way to new capabilities like the free-threaded interpreter in Python 3.13 and 3.14. Each tip includes code examples so you can see the difference for yourself.
1. Profile Before You Optimize
Before changing anything, you need to know where your code is actually slow. Guessing at bottlenecks is unreliable, especially in larger applications with multiple dependencies and layers of abstraction. Python ships with several profiling tools that let you measure exactly where time is being spent.
The timeit module is ideal for quick comparisons between two approaches. It runs a snippet multiple times and reports the average execution time, eliminating noise from system-level fluctuations:
import timeit
# Compare two ways to build a list of squares
time_loop = timeit.timeit(
'[x**2 for x in range(1000)]',
number=10000
)
time_map = timeit.timeit(
'list(map(lambda x: x**2, range(1000)))',
number=10000
)
print(f"List comprehension: {time_loop:.4f}s")
print(f"map + lambda: {time_map:.4f}s")
For full-program profiling, cProfile shows you exactly which functions consume the most time. Run it from the command line with python -m cProfile -s cumtime your_script.py to sort results by cumulative time, making the biggest offenders immediately visible.
For long-running or production applications, Py-Spy is a sampling profiler that attaches to a running Python process without modifying your code. It generates flame graphs that provide a visual map of where execution time accumulates.
Always profile first. The bottleneck is rarely where you think it is. A function that runs once for 3 seconds matters more than a function that runs a million times but takes 0.001 seconds total.
2. Use Built-in Functions and Standard Library Modules
Python's built-in functions like sum(), min(), max(), sorted(), and any() are implemented in C under the hood. They operate at a lower level than interpreted Python bytecode, which means they skip the overhead of Python's dynamic typing and per-iteration interpreter work. The difference can be dramatic in tight loops.
import math
numbers = [3.5, 7.2, 1.8, 9.4, 5.1]
# Slow: manual loop to find the sum
total = 0
for n in numbers:
total += n
# Fast: built-in sum()
total = sum(numbers)
# Slow: manual square root with exponentiation
roots = [n ** 0.5 for n in numbers]
# Fast: math.sqrt() is optimized C
roots = [math.sqrt(n) for n in numbers]
The same principle applies to the standard library more broadly. The itertools module is a collection of highly optimized iterator-building blocks implemented in C. Functions like itertools.chain(), itertools.islice(), and itertools.groupby() can replace hand-written loop logic with faster, more memory-efficient alternatives.
from itertools import chain
# Combine multiple lists without creating an intermediate list
lists = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
# Slow: nested loop concatenation
combined = []
for sublist in lists:
for item in sublist:
combined.append(item)
# Fast: itertools.chain.from_iterable
combined = list(chain.from_iterable(lists))
If you only have time to study one standard library module for performance, make it itertools. Also consider installing more-itertools for additional utility functions not included in the standard library.
3. Choose the Right Data Structure
The data structure you pick has a direct impact on how fast your code runs. A common mistake is using a list when a set or dictionary would be far more appropriate for the task.
Checking whether an item exists in a list requires scanning through items one by one, which takes O(n) time. Sets and dictionaries use hash tables internally, so the same lookup takes O(1) time on average, regardless of size.
# Membership testing: list vs. set
user_ids_list = list(range(100_000))
user_ids_set = set(range(100_000))
target = 99_999
# Slow: O(n) scan
if target in user_ids_list:
pass
# Fast: O(1) hash lookup
if target in user_ids_set:
pass
For ordered collections where you frequently add or remove items from the front, collections.deque outperforms a regular list. Appending to or popping from the left side of a list is O(n) because every element needs to shift. A deque handles both ends in O(1).
from collections import deque
# Slow: list.insert(0, item) shifts all elements
queue = []
for i in range(10_000):
queue.insert(0, i)
# Fast: deque.appendleft() is O(1)
queue = deque()
for i in range(10_000):
queue.appendleft(i)
Tuples also deserve a mention. Because tuples are immutable, Python can apply internal optimizations like caching small tuples and allocating memory more efficiently. If you have data that will not change after creation, using a tuple instead of a list can provide a small but consistent speed improvement.
4. Cache Expensive Function Calls with lru_cache
When a function is called repeatedly with the same arguments, you are doing the same computation over and over. The functools.lru_cache decorator stores previously computed results and returns them instantly on subsequent calls with the same inputs. This is particularly powerful for recursive functions.
from functools import lru_cache
# Without caching: exponential time complexity
def fibonacci_slow(n):
if n < 2:
return n
return fibonacci_slow(n - 1) + fibonacci_slow(n - 2)
# With caching: linear time complexity
@lru_cache(maxsize=256)
def fibonacci_fast(n):
if n < 2:
return n
return fibonacci_fast(n - 1) + fibonacci_fast(n - 2)
# fibonacci_slow(35) takes several seconds
# fibonacci_fast(35) returns almost instantly
The maxsize parameter controls how many results are stored. Once the cache is full, the least recently used entry gets evicted. For functions where you want unlimited caching, pass maxsize=None. Keep in mind that the function arguments need to be hashable (strings, numbers, tuples) since they are used as dictionary keys internally.
Starting in Python 3.9, you can also use @functools.cache as a simpler shortcut for @lru_cache(maxsize=None) when you want to cache every unique call without a size limit.
5. Use Generators Instead of Lists for Large Data
When you process a large sequence of items, building the entire list in memory first is wasteful. Generators produce items one at a time, on demand, keeping memory usage constant regardless of how much data you are working with.
# Memory-heavy: creates the full list in memory
squares_list = [x ** 2 for x in range(10_000_000)]
total = sum(squares_list)
# Memory-efficient: generates values one at a time
squares_gen = (x ** 2 for x in range(10_000_000))
total = sum(squares_gen)
The generator expression looks almost identical to a list comprehension, but uses parentheses instead of brackets. Under the hood, the difference is significant: the list version allocates memory for all 10 million integers at once, while the generator version yields each value and discards it after use.
For more complex logic, use the yield keyword inside a function to create a generator function:
def read_large_file(filepath):
"""Process a large file line by line without
loading the entire file into memory."""
with open(filepath, 'r') as f:
for line in f:
yield line.strip()
# Processes one line at a time
for line in read_large_file('server_logs.txt'):
if 'ERROR' in line:
print(line)
6. Avoid Repeated Attribute Lookups in Loops
Every time you use the dot operator to access a method or attribute on an object, Python performs a hash table lookup. In a tight loop that runs thousands or millions of times, those lookups add up. You can eliminate this overhead by assigning the method to a local variable before the loop starts.
import math
values = range(1_000_000)
# Slow: attribute lookup on every iteration
results = []
for v in values:
results.append(math.sqrt(v))
# Fast: local reference eliminates repeated lookups
results = []
sqrt = math.sqrt
append = results.append
for v in values:
append(sqrt(v))
This technique is used extensively throughout the Python standard library itself. Local variable access is faster than global or attribute access because CPython uses indexed array lookups for locals, while globals and attributes require dictionary lookups.
This optimization matters in hot loops that run millions of iterations. For code that only runs a handful of times, clarity should take priority over this kind of micro-optimization.
7. Prefer List Comprehensions and Map Over Manual Loops
List comprehensions are not just syntactic sugar. CPython implements them with optimized bytecode that avoids the overhead of calling .append() on every iteration. The result is typically 20-30% faster than an equivalent for loop that builds a list manually.
data = range(100_000)
# Slower: explicit loop with append
result = []
for x in data:
if x % 2 == 0:
result.append(x ** 2)
# Faster: list comprehension
result = [x ** 2 for x in data if x % 2 == 0]
The map() function can be even faster than a list comprehension when applying a single function to every element, because it avoids the overhead of evaluating a Python expression on each iteration. This advantage is strongest when the mapped function is a C-implemented built-in:
# Fast: map with a built-in function
names = ['alice', 'bob', 'charlie', 'diana']
upper_names = list(map(str.upper, names))
When you need to apply a lambda or a more complex expression, a list comprehension is usually both faster and more readable than map().
8. Use __slots__ to Reduce Memory and Speed Up Object Creation
By default, Python objects store their attributes in a dictionary (__dict__), which is flexible but comes with memory overhead. If you know in advance exactly which attributes your class will have, defining __slots__ tells Python to use a more compact, fixed-size internal structure instead.
# Standard class: each instance carries a __dict__
class PointDict:
def __init__(self, x, y):
self.x = x
self.y = y
# Slots class: lighter and faster
class PointSlots:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
The savings are small per object, but when you create millions of instances, the difference in both memory and creation speed becomes substantial. In benchmarks, __slots__ classes typically use 30-40% less memory per instance and create objects measurably faster.
The trade-off is reduced flexibility: you cannot dynamically add new attributes to a __slots__ class, and you lose the ability to use features that depend on __dict__. Use this technique for data-heavy classes where you are creating a large number of instances with a fixed set of attributes.
9. Offload Heavy Work to Optimized Libraries
For numerically intensive work, the fastest Python code is often code that does not run in Python at all. Libraries like NumPy, Polars, and Numba execute their core operations in compiled C, C++, or LLVM-generated machine code, while you write the orchestration logic in Python.
import numpy as np
# Slow: pure Python loop for element-wise addition
a = list(range(1_000_000))
b = list(range(1_000_000))
result = [x + y for x, y in zip(a, b)]
# Fast: NumPy vectorized operation (runs in C)
a = np.arange(1_000_000)
b = np.arange(1_000_000)
result = a + b
NumPy's vectorized operations avoid the per-element Python interpreter overhead entirely. The addition happens in a single C-level loop over contiguous memory, which can be orders of magnitude faster than the pure Python equivalent.
For JIT compilation of custom numerical functions, Numba compiles decorated Python functions into machine code at runtime:
from numba import jit
@jit(nopython=True)
def compute_sum(arr):
total = 0.0
for x in arr:
total += x
return total
Numba's nopython=True mode compiles the function entirely to machine code, bypassing the Python interpreter. The first call incurs a compilation delay, but subsequent calls run at near-C speed.
If you are processing tabular data, consider Polars as an alternative to Pandas. Polars is written in Rust, supports lazy evaluation, and can be significantly faster for large data operations.
10. Take Advantage of Free-Threaded Python
Python 3.13 introduced an experimental free-threaded build that allows the Global Interpreter Lock (GIL) to be disabled. The GIL has traditionally prevented Python threads from running in parallel on multiple CPU cores, meaning that multi-threaded Python programs for CPU-bound work could not actually use more than one core at a time.
With the free-threaded build, threads can run truly in parallel. In Python 3.14, free-threaded mode is no longer considered experimental, and the single-threaded performance penalty has been reduced to roughly 5-10% depending on the platform. For CPU-heavy multi-threaded workloads, benchmarks show speedups of around 3x on four threads compared to the standard GIL-enabled interpreter.
import sys
from concurrent.futures import ThreadPoolExecutor
def cpu_intensive_task(n):
"""A CPU-bound calculation."""
total = 0
for i in range(n):
total += i * i
return total
# Check if the GIL is disabled
try:
gil_enabled = sys._is_gil_enabled()
except AttributeError:
gil_enabled = True
print(f"GIL enabled: {gil_enabled}")
# Run tasks in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [
executor.submit(cpu_intensive_task, 10_000_000)
for _ in range(4)
]
results = [f.result() for f in futures]
On the standard GIL-enabled interpreter, the four threads above would effectively run one at a time. On the free-threaded build (installed via python3.14t), they run simultaneously across four cores, completing the workload in roughly a quarter of the time.
Free-threaded Python is best suited for CPU-bound parallel workloads. For single-threaded code or I/O-bound tasks, the standard interpreter remains the better choice. Also check that your dependencies support the free-threaded build before adopting it in production, as some C extensions may re-enable the GIL automatically.
Key Takeaways
- Measure first: Use
timeit,cProfile, or Py-Spy to find actual bottlenecks before optimizing anything. - Use C-backed built-ins: Functions like
sum(),sorted(), andmath.sqrt()are implemented in C and run significantly faster than hand-written Python equivalents. - Pick the right data structure: Sets and dictionaries provide O(1) lookups; deques handle double-ended operations in O(1); tuples are lighter than lists for fixed data.
- Cache repeated computations: The
@lru_cachedecorator can turn exponential-time recursive functions into linear-time operations. - Use generators for large data: Generator expressions and
yieldkeep memory usage constant regardless of data volume. - Minimize attribute lookups in loops: Assign frequently accessed methods to local variables before entering a hot loop.
- Prefer comprehensions: List comprehensions use optimized bytecode and are typically 20-30% faster than equivalent manual loops.
- Apply __slots__ for data-heavy classes: When creating millions of objects with a fixed set of attributes,
__slots__reduces memory and speeds up creation. - Offload to compiled libraries: NumPy, Numba, and Polars execute core operations in compiled code, providing order-of-magnitude speedups for numerical work.
- Explore free-threaded Python: Python 3.14's free-threaded build enables true multi-core parallelism for CPU-bound work, with up to 3x speedups on four threads.
Speed optimization in Python is about making informed choices, not rewriting everything in C. Start with profiling, apply the techniques above where they matter, and you will get a faster program that is still readable, maintainable Python.