If you are writing Python code that talks to an API -- any API -- you will eventually see a 429 response. It is the server's way of telling your client to slow down. Retrying immediately makes the problem worse. Exponential backoff is the standard solution, and Python gives you several clean ways to implement it.
This article starts with the fundamentals of what a 429 response contains and how to read the Retry-After header. It then builds a retry function from scratch using the standard library, explains why adding jitter is critical for distributed systems, and examines how retry amplification can turn a single slow service into a system-wide outage across multi-tier architectures. It covers when retrying is not safe due to idempotency concerns, walks through the built-in retry support in urllib3 and HTTPAdapter, and shows how the tenacity library handles complex retry patterns with a single decorator -- including both synchronous requests and asynchronous httpx examples. It also addresses connection errors and timeouts, explains what to do when all retries are exhausted, introduces the circuit breaker as the pattern that decides when to stop retrying entirely, and maps how all of these resilience patterns fit together as a connected system. It closes with a status code retry decision table, a production deployment checklist, and a curated list of primary sources so you can verify every claim.
What HTTP 429 Means and Why It Happens
HTTP 429 is defined in RFC 6585, authored by Mark Nottingham and Roy Fielding and published as an IETF Standards Track document in April 2012. The RFC defines 429 as the status code for "too many requests in a given amount of time." Unlike a 503 (Service Unavailable), which indicates the server itself is struggling, a 429 is a deliberate, targeted response: the server is healthy, but it has decided that your specific client has exceeded its allowed rate.
APIs enforce rate limits for several reasons. They protect back-end infrastructure from being overwhelmed by a single consumer. They ensure fair access across all clients. And they guard against abuse -- whether from a bot, a misconfigured script, or a deliberate attack. When you hit a 429, the API is not broken. It is working exactly as designed.
A 429 response may include several useful headers that tell you about your current rate limit status. The RFC notes that the server may include a Retry-After header specifying how long the client should wait before retrying. Beyond Retry-After, many APIs also send non-standard headers like X-RateLimit-Limit (your total allowance), X-RateLimit-Remaining (how many requests you have left), and X-RateLimit-Reset (when your limit resets). Not all APIs include all of these, but Retry-After is the one that matters for retry logic.
Many API providers use a token bucket algorithm internally, meaning your rate limit replenishes continuously rather than resetting all at once. If your limit is 60 RPM, you effectively get 1 request per second rather than 60 requests at the start of each minute. This means short bursts can trigger a 429 even if your average rate is well within the limit.
Reading the Retry-After Header
When a Retry-After header is present, always use it. The server is telling you exactly how long to wait, and guessing a shorter time will only get you another 429. As defined in RFC 7231 Section 7.1.3, the header value "can be either an HTTP-date or a number of seconds." Your code needs to handle both formats:
import time
import datetime
import requests
from email.utils import parsedate_to_datetime
def get_retry_after(response: requests.Response) -> float:
"""
Parse the Retry-After header from a 429 response.
Returns wait time in seconds. Falls back to 0 if
the header is missing or unparseable.
"""
retry_after = response.headers.get("Retry-After")
if retry_after is None:
return 0
# Try parsing as integer seconds first
try:
return float(retry_after)
except ValueError:
pass
# Try parsing as HTTP-date
try:
retry_date = parsedate_to_datetime(retry_after)
now = datetime.datetime.now(datetime.timezone.utc)
wait = (retry_date - now).total_seconds()
return max(0, wait)
except (ValueError, TypeError):
return 0
The dual-format parsing matters because different APIs use different formats. GitHub's API returns an integer in seconds. Other APIs might return a date string like Wed, 16 Mar 2026 15:00:00 GMT. Your retry logic should handle both.
Building Exponential Backoff from Scratch
When the Retry-After header is absent -- and it often is -- you need a strategy for deciding how long to wait between retries. Retrying immediately is the worst option. It wastes your remaining rate limit quota on a request that will almost certainly fail again, and it adds load to an already stressed server.
Exponential backoff progressively increases the delay between retries. The formula is simple: delay = base * (2 ** attempt). With a base delay of 1 second, your retries wait 1 second, then 2, then 4, then 8, and so on. Here is a complete implementation:
import time
import requests
def request_with_backoff(
method: str,
url: str,
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
**kwargs,
) -> requests.Response:
"""
Send an HTTP request with automatic retry on 429 errors.
Uses the Retry-After header when available and falls
back to exponential backoff when it is not.
"""
for attempt in range(max_retries + 1):
response = requests.request(method, url, **kwargs)
if response.status_code != 429:
return response
if attempt == max_retries:
# Out of retries -- return the 429 response
return response
# Prefer Retry-After header if present
retry_after = get_retry_after(response)
if retry_after > 0:
wait = retry_after
else:
# Exponential backoff: 1s, 2s, 4s, 8s, 16s ...
wait = min(base_delay * (2 ** attempt), max_delay)
print(f"429 received. Retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
return response
The max_delay cap prevents the backoff from growing indefinitely. Without it, attempt number 10 would wait over 17 minutes (1 * 2^10 = 1024 seconds). Capping at 60 seconds keeps the retry behavior reasonable while still giving the server time to recover.
Only retry on 429 and 503 status codes. Do not retry on 400 (bad request), 401 (unauthorized), 403 (forbidden), or 404 (not found). These are client errors that will not resolve themselves no matter how many times you retry. Retrying on them wastes time and rate limit quota.
HTTP Status Codes: Retry Decision Reference
One of the most common mistakes in retry logic is retrying indiscriminately on any error. The table below maps standard HTTP error codes to the correct retry decision. This reference saves you from building retry logic that wastes quota on errors that will never self-resolve:
| Status Code | Meaning | Retry? | Reasoning |
|---|---|---|---|
429 |
Too Many Requests | Yes | Rate limited -- will resolve after the cooldown window resets. |
503 |
Service Unavailable | Yes | Server is temporarily overloaded or in maintenance. Often includes Retry-After. |
502 |
Bad Gateway | Yes (cautiously) | Upstream server issue. May resolve quickly, but cap retries at 2-3 attempts. |
500 |
Internal Server Error | Maybe | Could be transient. Retry once or twice, then abort. |
408 |
Request Timeout | Yes | Server timed out waiting for the request. Retry is appropriate. |
400 |
Bad Request | No | Malformed request. Fix the request payload; retrying is pointless. |
401 |
Unauthorized | No | Invalid or expired credentials. Refresh the token, then retry once. |
403 |
Forbidden | No | Insufficient permissions. No amount of retrying will grant access. |
404 |
Not Found | No | Resource does not exist. Retrying wastes quota. |
422 |
Unprocessable Entity | No | Validation error. Fix the request data before retrying. |
Why Jitter Prevents the Thundering Herd
Plain exponential backoff has a weakness in distributed systems. Imagine 100 clients all hit a rate limit at the same moment. All 100 calculate the same backoff delay -- say, 2 seconds. Two seconds later, all 100 retry simultaneously, and the server immediately returns 429 to all of them again. This synchronized retry pattern is called the thundering herd problem.
The solution is jitter: adding a random component to the delay so that retries are spread out over time. There are three common jitter strategies:
| Strategy | Formula | Behavior |
|---|---|---|
| Full Jitter | random(0, base * 2^attempt) |
Wait a random time between 0 and the calculated backoff. Provides the widest spread. |
| Equal Jitter | (base * 2^attempt) / 2 + random(0, (base * 2^attempt) / 2) |
Wait at least half the backoff, plus a random amount. Avoids very short waits. |
| Decorrelated Jitter | min(max_delay, random(base, prev_delay * 3)) |
Each delay is randomized based on the previous delay rather than the attempt number. |
Full jitter is the recommended default for API clients. Marc Brooker, Senior Principal Engineer at AWS, demonstrated in the AWS Architecture Blog that full jitter reduces client work by more than half compared to un-jittered exponential backoff when tested with 100 contending clients. As the Amazon Builders' Library explains, jitter introduces randomness into the backoff delay to spread retries across time, preventing clients from retrying in synchronized bursts. Here is the backoff function updated with full jitter:
import random
def backoff_with_jitter(
attempt: int,
base_delay: float = 1.0,
max_delay: float = 60.0,
) -> float:
"""
Calculate exponential backoff with full jitter.
Returns a random delay between 0 and the exponential
backoff ceiling, capped at max_delay.
"""
ceiling = min(base_delay * (2 ** attempt), max_delay)
return random.uniform(0, ceiling)
When a Retry-After header is present, you can still add a small amount of jitter on top of it. The header tells you the minimum time to wait, but adding 1-2 seconds of randomness helps prevent synchronized retries across clients that all received the same Retry-After value at the same time.
Retry Amplification: When Backoff Alone Is Not Enough
Exponential backoff with jitter solves the thundering herd problem at a single service boundary. But in a multi-tier architecture -- where Service A calls Service B, which calls Service C, which calls Service D -- retry logic at every layer creates a compounding effect that backoff alone cannot prevent. This is retry amplification, and it is one of the ways a localized rate limit can escalate into a system-wide outage.
Consider a chain of four services, each configured with a modest retry count of two (the original attempt plus one retry). If Service D starts returning 429s, Service C retries each failed request once, doubling the traffic to D. Service B sees those failures propagate upward and retries its calls to C, which each trigger two calls to D. Service A does the same. The math is exponential: with K services in the chain and each allowing one retry, the bottom service receives up to 2K-1 times the original request volume. For four services, that is 8x. For ten services -- not unusual in a large microservices deployment -- it is 512x. What started as 100 requests per second to the bottom service becomes 51,200.
This is the core insight that separates production-grade retry logic from tutorial-grade retry logic: your retry policy does not exist in isolation. It interacts with the retry policies of every other service in the call chain. Adding jitter randomizes the timing, but it does not reduce the total volume of retry traffic flowing through the system. Three patterns address this gap:
Retry Budgets
A retry budget caps the total number of retries a service is allowed to generate as a percentage of its successful request volume. Instead of allowing every individual request to retry up to N times, the service tracks a rolling ratio: if more than, say, 10% of recent requests are retries, no further retries are permitted until the ratio drops. This prevents any single service from flooding its downstream dependencies during a sustained failure. Service meshes like Istio implement retry budgets at the proxy layer through Envoy's retry_budget configuration, making it possible to enforce this limit without modifying application code.
import time
from collections import deque
from threading import Lock
class RetryBudget:
"""
Enforce a retry budget: allow retries only when the
retry ratio stays below a configurable threshold.
Tracks requests over a rolling time window and limits
retries to a percentage of total traffic.
"""
def __init__(
self,
budget_ratio: float = 0.10,
window_seconds: float = 10.0,
):
self.budget_ratio = budget_ratio
self.window = window_seconds
self._requests: deque = deque()
self._retries: deque = deque()
self._lock = Lock()
def _prune(self, q: deque) -> None:
cutoff = time.monotonic() - self.window
while q and q[0] < cutoff:
q.popleft()
def record_request(self) -> None:
with self._lock:
self._requests.append(time.monotonic())
def can_retry(self) -> bool:
"""
Returns True if the current retry ratio is below
the budget threshold, False otherwise.
"""
with self._lock:
now = time.monotonic()
self._prune(self._requests)
self._prune(self._retries)
total = len(self._requests)
if total == 0:
return True
retry_count = len(self._retries)
if retry_count / total < self.budget_ratio:
self._retries.append(now)
return True
return False
The retry budget flips the mental model. Instead of asking "should this individual request retry?" it asks "is this service generating too many retries relative to its total traffic?" If 90% of your requests are succeeding, a few retries are fine. If you are already in a degraded state where failures are widespread, adding more retry traffic only deepens the hole.
Hedged Requests
An alternative to retrying after failure is hedging: sending a second copy of the request before the first one has failed, then taking whichever response comes back first. This is useful for latency-sensitive read operations where the tail latency (p99) is much worse than the median. Google's infrastructure uses this pattern extensively, as described in their paper "The Tail at Scale." The trade-off is that hedging doubles the baseline load, so it only makes sense for idempotent reads where the cost of an extra request is low compared to the cost of waiting.
Thinking in Layers
The broader lesson is that retry logic is not a single-service decision. It is a system-level concern. When you configure retries at the application layer, you need to account for what happens at the infrastructure layer too. Load balancers retry. Service mesh proxies retry. SDK clients retry. If all of these layers retry independently with no coordination, a transient failure at the bottom of the stack gets amplified into a sustained outage at the top. The most resilient systems limit retries at the layer closest to the failure, use retry budgets to cap amplification, and combine retries with circuit breakers (covered in a later section) to stop retrying entirely when a dependency is clearly down.
Idempotency: When Retrying Is Not Safe
Every code example shown so far assumes one thing: that sending the same request a second time will not cause unintended side effects. For GET requests, that assumption holds. A GET is inherently idempotent -- no matter how many times you send it, the server state does not change. But a POST that creates a new order, a PATCH that increments a counter, or a DELETE without conditional headers may not be safe to retry blindly.
Consider what happens when a POST request to a payment API gets a 429 response. Your retry logic waits, then sends the same POST again. But what if the server received and processed the first request before returning the 429? You now have two charges instead of one. The 429 tells you the server is rate-limiting -- it does not guarantee the original request was never processed.
The safest approach is to use idempotency keys. An idempotency key is a unique identifier (typically a UUID) that you include as a header with each request. If the server receives a second request with the same key, it returns the original response instead of processing the request again. Stripe, PayPal, and many payment APIs support this pattern natively through an Idempotency-Key header:
import uuid
import requests
def post_with_idempotency(
url: str,
json_data: dict,
max_retries: int = 5,
base_delay: float = 1.0,
**kwargs,
) -> requests.Response:
"""
POST with an idempotency key so retries on 429
never create duplicate resources.
"""
idempotency_key = str(uuid.uuid4())
headers = kwargs.pop("headers", {})
headers["Idempotency-Key"] = idempotency_key
for attempt in range(max_retries + 1):
response = requests.post(
url, json=json_data, headers=headers, **kwargs
)
if response.status_code != 429:
return response
if attempt == max_retries:
return response
retry_after = get_retry_after(response)
wait = retry_after if retry_after > 0 else min(
base_delay * (2 ** attempt), 60.0
)
time.sleep(wait)
return response
The key is generated once before the retry loop begins, then reused on every attempt. This ensures the server can deduplicate the request regardless of how many times it arrives. If the API you are calling does not support idempotency keys, limit automatic retries to GET, HEAD, PUT, and DELETE -- the methods that HTTP defines as idempotent. For non-idempotent POST and PATCH calls, let the failure propagate to the caller and let application-level logic decide whether it is safe to retry.
The urllib3 Retry class defaults to retrying only idempotent methods (GET, HEAD, PUT, DELETE, OPTIONS, TRACE) for exactly this reason. If you override allowed_methods to include POST, you are accepting responsibility for ensuring those calls are safe to retry -- either through idempotency keys or because the endpoint is inherently idempotent.
Built-in Retry with urllib3 and HTTPAdapter
Before reaching for an external library, it is worth knowing that the requests library already ships with retry support through its underlying urllib3 layer. The urllib3.util.Retry class provides exponential backoff, status code filtering, and automatic Retry-After header support out of the box. You wire it up through a requests.adapters.HTTPAdapter:
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
def create_retry_session(
retries: int = 3,
backoff_factor: float = 1.0,
status_forcelist: tuple = (429, 500, 502, 503, 504),
) -> requests.Session:
"""
Create a requests Session with automatic retry on
transient errors. Uses urllib3's built-in Retry class.
"""
retry_strategy = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
# urllib3 reads and respects Retry-After by default
respect_retry_after_header=True,
# Only retry idempotent methods by default
allowed_methods=["GET", "HEAD", "PUT", "DELETE", "OPTIONS"],
# Add jitter to spread out retries (urllib3 2.x+)
backoff_jitter=0.5,
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session = requests.Session()
session.mount("https://", adapter)
session.mount("http://", adapter)
return session
# Usage -- every request through this session retries automatically
session = create_retry_session()
response = session.get("https://api.example.com/data")
The backoff_factor controls the exponential delay formula: backoff_factor * (2 ** (retry_number - 1)). With a factor of 1.0, retries wait 0.5 seconds, then 1, then 2, then 4. The backoff_jitter parameter (added in urllib3 2.0) injects randomness into the delay, addressing the thundering herd problem without requiring any extra code. Setting respect_retry_after_header=True (which is the default) means urllib3 will automatically honor the Retry-After header when present, sleeping for the server-specified duration instead of using the calculated backoff.
This approach has a meaningful advantage for straightforward use cases: it requires zero external dependencies beyond requests, which you are already using. It also handles the retry transparently at the transport layer, so your business logic only sees the final response. The trade-off is that urllib3's Retry is less composable than tenacity. You cannot easily add custom callbacks, combine multiple wait strategies, or use it with async code. For simple API clients where you just need reliable GET requests with backoff, the built-in approach is often all you need. For more complex patterns -- custom exception handling, Retry-After-aware wait functions, async support, or per-function retry policies -- tenacity gives you finer control.
Using Tenacity for Production Retry Logic
Building retry logic from scratch is a useful exercise, but for production code the tenacity library (currently at version 9.1.4, Apache 2.0 licensed) handles the complexity more reliably. Originally forked from the now-unmaintained retrying library, tenacity provides decorators for exponential backoff, jitter, retry-on-exception filtering, stop conditions, and logging -- all composable with a clean API.
Install it with pip install tenacity, then apply it to any function that makes API calls:
import requests
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
before_sleep_log,
)
import logging
logger = logging.getLogger(__name__)
class RateLimitError(Exception):
"""Raised when an API returns HTTP 429."""
def __init__(self, response: requests.Response):
self.response = response
self.retry_after = response.headers.get("Retry-After")
super().__init__(f"429 Too Many Requests (Retry-After: {self.retry_after})")
def raise_for_rate_limit(response: requests.Response) -> None:
"""Raise RateLimitError on 429 responses."""
if response.status_code == 429:
raise RateLimitError(response)
@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(6),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_data(url: str, **kwargs) -> dict:
"""
Fetch data from an API with automatic retry on rate limits.
Uses tenacity's random exponential backoff (full jitter)
and stops after 6 attempts.
"""
response = requests.get(url, **kwargs)
raise_for_rate_limit(response)
response.raise_for_status()
return response.json()
The wait_random_exponential strategy implements the "Full Jitter" algorithm described in the AWS Architecture Blog. The multiplier=1 means the first retry waits between 0 and 1 second, the second between 0 and 2, the third between 0 and 4, and so on up to max=60. The before_sleep_log callback logs each retry attempt at WARNING level, which is essential for monitoring in production.
The custom RateLimitError exception lets tenacity distinguish between rate limit errors (which should be retried) and other HTTP errors (which should not). A 400 or 401 will propagate immediately without wasting retries.
Respecting Retry-After with tenacity
You can combine tenacity's backoff with the Retry-After header by writing a custom wait function:
from tenacity import wait_base
class wait_from_header(wait_base):
"""
Custom tenacity wait that uses the Retry-After header
when available, falling back to exponential backoff.
"""
def __init__(self, fallback_multiplier=1, fallback_max=60):
self.fallback = wait_random_exponential(
multiplier=fallback_multiplier, max=fallback_max
)
def __call__(self, retry_state):
exc = retry_state.outcome.exception()
if isinstance(exc, RateLimitError) and exc.retry_after:
try:
return float(exc.retry_after)
except (ValueError, TypeError):
pass
return self.fallback(retry_state)
@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_from_header(fallback_multiplier=1, fallback_max=60),
stop=stop_after_attempt(6),
)
def fetch_data_smart(url: str, **kwargs) -> dict:
"""Fetch with Retry-After awareness."""
response = requests.get(url, **kwargs)
raise_for_rate_limit(response)
response.raise_for_status()
return response.json()
Async Retry with httpx and tenacity
If your application uses asyncio and httpx for non-blocking HTTP requests, tenacity works identically with async functions. The decorator detects the coroutine automatically and uses asyncio.sleep instead of time.sleep:
import httpx
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
)
class AsyncRateLimitError(Exception):
"""Raised on 429 from async HTTP calls."""
def __init__(self, response: httpx.Response):
self.response = response
self.retry_after = response.headers.get("Retry-After")
super().__init__(f"429 Too Many Requests")
@retry(
retry=retry_if_exception_type(AsyncRateLimitError),
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(6),
)
async def fetch_async(client: httpx.AsyncClient, url: str) -> dict:
"""
Async fetch with automatic retry on rate limits.
Tenacity detects the coroutine and awaits sleep
instead of blocking the event loop.
"""
response = await client.get(url)
if response.status_code == 429:
raise AsyncRateLimitError(response)
response.raise_for_status()
return response.json()
# Usage
async def main():
async with httpx.AsyncClient() as client:
data = await fetch_async(client, "https://api.example.com/data")
print(data)
When making many concurrent async requests, combine retry logic with a semaphore or rate limiter like asyncio.Semaphore to proactively limit your request rate. It is far better to throttle yourself below the rate limit than to repeatedly hit it and rely on backoff to recover.
Handling Connection Errors and Timeouts
So far, every example has focused on HTTP 429 status codes -- cases where the server successfully responds with a rate limit signal. But in production, many failures never produce a status code at all. The TCP connection might be refused. The DNS lookup might fail. The server might accept the connection but never send a response, causing a read timeout. These transport-layer failures are just as common as 429 responses, and your retry logic needs to handle them too.
The requests library raises specific exceptions for these scenarios: ConnectionError for refused or dropped connections, Timeout for read or connect timeouts, and ChunkedEncodingError for incomplete response bodies. With tenacity, you can combine these with your existing rate limit handling by retrying on multiple exception types:
import requests
from requests.exceptions import (
ConnectionError,
Timeout,
ChunkedEncodingError,
)
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
before_sleep_log,
)
import logging
logger = logging.getLogger(__name__)
# Combine all retryable exceptions into one condition
RETRYABLE_ERRORS = (
RateLimitError, # HTTP 429 (from earlier example)
ConnectionError, # Connection refused / dropped
Timeout, # Connect or read timeout
ChunkedEncodingError, # Incomplete response
)
@retry(
retry=retry_if_exception_type(RETRYABLE_ERRORS),
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(6),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_resilient(url: str, **kwargs) -> dict:
"""
Fetch data with retry on both HTTP 429 and
transport-layer failures like timeouts and
connection errors.
"""
kwargs.setdefault("timeout", (5, 30))
response = requests.get(url, **kwargs)
raise_for_rate_limit(response)
response.raise_for_status()
return response.json()
The timeout=(5, 30) tuple sets a 5-second connect timeout and a 30-second read timeout. Without an explicit timeout, requests will wait indefinitely for a response, which means a single stalled connection can block your entire application. Always set a timeout -- it is the difference between a retry loop that recovers in seconds and a process that hangs until someone notices.
Be careful about retrying after a Timeout on non-idempotent requests. A read timeout means the server may have received and processed your request -- it just took too long to send the response back. The idempotency key pattern from the earlier section applies here too: if you are retrying a POST after a timeout, include an idempotency key to prevent duplicate processing.
When All Retries Fail: Graceful Degradation
Retry logic handles transient failures. But what happens when the failures are not transient? When the API is down for extended maintenance, or your account has been suspended, or the rate limit window is so long that your maximum backoff expires before it resets? Your code needs a plan for what to do after the last retry fails.
The worst outcome is an unhandled exception that crashes your application. The second worst is a vague error message that tells the user nothing. A well-designed failure path should do three things: log enough context to diagnose the problem, return a meaningful result to the caller, and avoid cascading the failure to unrelated parts of the system.
from tenacity import RetryError
def get_user_profile(user_id: str) -> dict:
"""
Fetch a user profile with a fallback strategy
when all retries are exhausted.
"""
try:
return fetch_resilient(
f"https://api.example.com/users/{user_id}"
)
except RetryError as exc:
logger.error(
"All retries exhausted for user %s: %s",
user_id,
exc.last_attempt.exception(),
)
# Strategy 1: Return cached data if available
cached = cache.get(f"user:{user_id}")
if cached:
logger.info("Serving cached profile for user %s", user_id)
return cached
# Strategy 2: Return a degraded response
return {
"user_id": user_id,
"status": "unavailable",
"message": "Profile temporarily unavailable. Try again later.",
}
The RetryError that tenacity raises after exhausting all attempts wraps the last exception in exc.last_attempt.exception(), giving you the original error for logging. From there, you have several options depending on your application's requirements. You can serve stale data from a cache, return a degraded response that lets the rest of the application continue, queue the failed request for later processing, or propagate a structured error to the UI layer so the user understands what happened and when to try again.
The right strategy depends on how critical the data is. A social media feed can show cached posts without the user noticing. A payment confirmation cannot serve stale data -- it needs to surface the failure clearly and let the user decide what to do next. The important thing is that your code makes this decision explicitly rather than letting an uncaught exception make it for you.
The Circuit Breaker: Knowing When to Stop Trying
Retry logic assumes that the failure is transient -- that if you wait long enough and try again, the request will eventually succeed. But what happens when the assumption is wrong? When the downstream service is not briefly rate-limiting you but is genuinely down for an extended period, every retry is a wasted connection, a wasted thread, and wasted time. Worse, those retries are adding load to a system that is already struggling to recover.
The circuit breaker pattern, popularized by Michael Nygard in Release It!, addresses this by monitoring the failure rate of outgoing calls and automatically stopping requests when failures exceed a threshold. It operates as a three-state machine:
In Python, the pybreaker library provides a clean implementation that integrates naturally with tenacity. The circuit breaker wraps the outgoing call. Tenacity handles per-request retry logic (backoff, jitter, stop conditions). The circuit breaker handles the cross-request concern: has this dependency been failing consistently enough that we should stop calling it entirely?
import pybreaker
import requests
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
)
# Circuit breaker: open after 5 consecutive failures,
# stay open for 30 seconds before probing.
api_breaker = pybreaker.CircuitBreaker(
fail_max=5,
reset_timeout=30,
exclude=[
lambda e: isinstance(e, requests.HTTPError)
and e.response.status_code < 500
],
)
class RateLimitError(Exception):
def __init__(self, response):
self.response = response
super().__init__(f"429 Too Many Requests")
@retry(
retry=retry_if_exception_type(
(RateLimitError, pybreaker.CircuitBreakerError)
),
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(6),
)
@api_breaker
def fetch_with_breaker(url: str, **kwargs) -> dict:
"""
Fetch data protected by both tenacity retry logic
and a pybreaker circuit breaker.
Tenacity handles transient 429s with backoff.
The circuit breaker trips after 5 consecutive
failures and fails fast for 30 seconds, preventing
your application from piling retries onto a
dependency that is clearly down.
"""
response = requests.get(url, timeout=(5, 30), **kwargs)
if response.status_code == 429:
raise RateLimitError(response)
response.raise_for_status()
return response.json()
The decorator stacking matters here. The @api_breaker decorator wraps the function first (innermost), so it monitors every call including retries. The @retry decorator wraps the breaker (outermost), so it can catch CircuitBreakerError and back off before probing again. When the circuit opens, the breaker raises CircuitBreakerError immediately without making a network call. This fail-fast behavior is what protects your application: instead of waiting for a 30-second timeout on every request to a dead service, the breaker returns in microseconds.
The exclude parameter is important. Client errors like 400 and 404 should not count toward the failure threshold -- they indicate problems with your request, not problems with the service. Only server-side failures (5xx) and transport errors should trip the breaker.
Should a 429 trip the circuit breaker? It depends. A 429 from a rate limit with a short Retry-After window is transient -- the server is healthy, you just need to slow down. But a sustained burst of 429s across all your requests may indicate that your client is systematically exceeding its quota, and continuing to retry will not help. A practical approach is to count 429s toward the breaker threshold only if they lack a Retry-After header or if the header specifies a wait longer than your maximum backoff cap.
The Resilience Stack: How the Patterns Connect
Every pattern covered in this article addresses a different failure mode. Individually, each one solves a specific problem. Together, they form a layered defense where each pattern covers the gaps that the others leave open. Understanding how they relate to each other -- what each one assumes, and where each one breaks down -- is what separates retry logic that works in a tutorial from retry logic that survives production.
The stack reads from top to bottom as an escalation path. The first layer -- proactive throttling -- is the line of defense that prevents 429 errors from occurring in the first place. If your client respects the API's documented rate limit by throttling outbound requests with a semaphore or token bucket, retries become a safety net rather than a primary recovery mechanism. When a 429 does arrive, the Retry-After header provides the server's preferred wait time. If the header is absent, exponential backoff with jitter calculates a reasonable delay while preventing synchronized retries across clients.
These three layers handle transient, single-request failures. The layers below them handle systemic failures that span multiple requests. A retry budget limits the total volume of retry traffic your service generates, preventing the amplification effect where retries at each tier multiply the load on downstream services. The circuit breaker goes further: when a dependency has failed consistently enough to cross a threshold, it stops all outbound requests to that dependency and fails fast, giving the struggling service room to recover without additional pressure.
At the bottom of the stack is graceful degradation -- the plan for what your application does when every other layer has been exhausted. This is where cached data, degraded responses, and structured error messages come into play. The key insight is that each layer in the stack assumes the layer above it has already failed. Backoff assumes the Retry-After header was missing. The retry budget assumes individual backoff was not enough to prevent overload. The circuit breaker assumes the retry budget was not enough to prevent sustained failure. And graceful degradation assumes the circuit breaker has tripped and the dependency is offline. Designing for each layer forces you to think through the full spectrum of failure, from a single rate-limited request all the way to a dependency that is unreachable for minutes or hours.
Production Retry Checklist
Before deploying retry logic to production, walk through this checklist. Each item addresses a failure mode that has caused outages in real distributed systems:
- Set a maximum retry count. Infinite retries can hold connections open indefinitely and exhaust connection pools. A limit of 5-6 attempts with exponential backoff covers transient issues without risking resource exhaustion.
- Log every retry with context. Include the URL, status code, attempt number, and calculated wait time. Without this, debugging rate limit issues in production becomes guesswork. Use tenacity's
before_sleep_logor a custombefore_sleepcallback. - Distinguish retryable from non-retryable errors. Use the status code decision table above. Retrying a 401 or 404 wastes both time and rate limit quota.
- Use a circuit breaker for sustained failures. If an API returns 429 or 503 consistently across many requests, stop making requests entirely for a cooldown period rather than burning through retries on every individual call. Libraries like pybreaker complement tenacity well for this pattern.
- Account for retry amplification. If your service sits in a multi-tier call chain, your retries compound with retries at every other layer. Implement a retry budget that caps retry traffic as a percentage of total traffic, and coordinate retry policies with upstream and downstream teams so the system does not generate more retry volume than it can absorb.
- Monitor your retry rate as a metric. A sudden spike in retries is an early warning signal. Export retry counts and durations to your observability stack (Prometheus, Datadog, etc.) and set alerts when retry rates exceed a threshold. Track circuit breaker state transitions alongside retry metrics to get a complete view of dependency health.
- Test your retry logic with fault injection. Use tools like responses or respx to simulate 429 responses with and without
Retry-Afterheaders in your test suite. Verify that your backoff timing, jitter, and max-retry behavior work as expected under controlled conditions. - Respect the API's documented rate limits proactively. The best retry is the one that never has to happen. If an API allows 100 requests per minute, throttle your client to 80 RPM using a token bucket or semaphore, and reserve the retry path for genuine transient failures.
- Audit every layer that retries. Your application code retries. Your HTTP client library might retry. Your load balancer might retry. Your service mesh proxy might retry. Map every retry layer in your stack and ensure they do not multiply each other's retry counts. The total retry fan-out across all layers should be bounded, not accidental.
Key Takeaways
- Always check the Retry-After header first: When a 429 response includes
Retry-After, the server is telling you the optimal wait time. Use it. Only fall back to calculated backoff when the header is missing. - Exponential backoff prevents request storms: Doubling the wait time between retries (1s, 2s, 4s, 8s) gives the server progressively more time to recover. Capping the maximum delay at 30-60 seconds keeps retry behavior practical.
- Jitter is not optional in distributed systems: Without random jitter, multiple clients that hit a rate limit at the same moment will all retry at the same moment. Full jitter -- randomizing the entire delay range -- provides the widest spread and resolves contention fastest.
- Retry amplification is the hidden multiplier: In a multi-tier architecture, retries at every layer compound exponentially. With K services each allowing one retry, the bottom service can receive 2K-1 times the original load. Retry budgets and cross-layer coordination are essential to keep the total retry volume bounded.
- Check idempotency before retrying writes: GET and PUT are safe to retry. POST and PATCH are not -- unless the API supports idempotency keys. Retrying a non-idempotent request risks creating duplicates, double-charging, or corrupting state.
- Use the built-in urllib3 Retry for simple cases: The
requestslibrary already supports retry throughHTTPAdapterandurllib3.util.Retry. For straightforward GET-based API clients, this requires zero external dependencies. - Only retry on retryable errors: Retry on 429 (rate limited) and 503 (temporarily unavailable). Do not retry on 400, 401, 403, or 404 -- these errors will not resolve themselves and retrying wastes quota.
- Handle transport failures, not just status codes: Connection errors, timeouts, and incomplete responses are just as common as 429s in production. Always set explicit timeouts and include transport-layer exceptions in your retry conditions.
- Use circuit breakers to stop retrying when a dependency is down: Retry logic assumes transient failure. A circuit breaker detects persistent failure and fails fast, protecting both your application and the struggling dependency from additional load. Combine
pybreakerwithtenacityfor layered protection. - Plan for total failure: When all retries are exhausted, your code needs a deliberate fallback: cached data, degraded responses, or structured errors for the caller. An unhandled
RetryErroris a crash waiting to happen. - Think in layers, not in isolation: Proactive throttling, Retry-After headers, backoff with jitter, retry budgets, circuit breakers, and graceful degradation form a connected resilience stack. Each layer handles what the layer above it cannot. Designing for the full stack forces you to think through the complete spectrum of failure.
- Use tenacity for production code: The
tenacitylibrary provides composable decorators for backoff, jitter, stop conditions, and logging. It works with both sync and async functions and keeps retry logic out of your business logic.
A 429 response is not a failure -- it is a signal. The API is telling your client exactly what to do: slow down and try again. But building reliable retry logic means thinking beyond the happy path. It means checking whether a request is safe to retry before you retry it. It means handling the connection errors and timeouts that never produce a status code at all. It means understanding that your retry policy does not exist in isolation -- it interacts with every other retry policy in the call chain, and without coordination, the compounding effect can turn a minor rate limit into a cascading outage. It means knowing when to stop retrying entirely and let a circuit breaker fail fast so a struggling dependency can recover. And it means planning for what happens when every layer of defense has been exhausted. Exponential backoff with jitter is the proven foundation, but it is only one layer of a resilience stack that spans proactive throttling, retry budgets, circuit breakers, and graceful degradation. Python's ecosystem -- from urllib3's built-in Retry to tenacity's composable decorators to pybreaker's circuit breaker implementation -- gives you the tools to build every layer of that stack.
Sources and Further Reading
- M. Nottingham, R. Fielding. RFC 6585 -- Additional HTTP Status Codes. IETF Standards Track, April 2012. Defines HTTP 429 Too Many Requests.
- R. Fielding, J. Reschke. RFC 7231 -- Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, Section 7.1.3. IETF, June 2014. Defines the Retry-After header format (integer seconds or HTTP-date).
- Marc Brooker. "Exponential Backoff and Jitter". AWS Architecture Blog, 2015 (updated May 2023). Original analysis comparing Full Jitter, Equal Jitter, and Decorrelated Jitter strategies with simulation data.
- Marc Brooker. "Timeouts, retries, and backoff with jitter". Amazon Builders' Library. Production guidance on retry patterns used internally at Amazon.
- Julien Danjou. Tenacity Documentation. Apache 2.0 licensed Python retry library, version 9.1.4 (February 2026). PyPI | GitHub.
- MDN Web Docs. "429 Too Many Requests". Mozilla Developer Network. Reference documentation for HTTP 429 behavior.
- MDN Web Docs. "Retry-After". Mozilla Developer Network. Reference documentation for the Retry-After header.
- urllib3 Contributors. urllib3.util.Retry Documentation. urllib3 2.x. Built-in retry configuration for Python HTTP requests, including backoff_factor, backoff_jitter, status_forcelist, and Retry-After support.
- Google Cloud. "Retry strategy". Google Cloud Storage Documentation. Production guidance on idempotency considerations for retry logic, including conditional idempotency and safe retry classification.
- AWS. "Retry with backoff pattern". AWS Prescriptive Guidance. Cloud design pattern reference covering idempotency requirements, circuit breaker integration, and fail-fast scenarios for retry logic.
- Michael Nygard. Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf, 2007 (2nd edition 2018). Introduced the circuit breaker pattern for software systems. Foundational reference for stability patterns including bulkheads, timeouts, and fail-fast behavior in distributed architectures.
- Agoda Engineering. "How Agoda Solved Retry Storms to Boost System Reliability". Medium, August 2024. Practical case study demonstrating the 2K-1 retry amplification formula across service chains, with Envoy retry budget configuration as the mitigation strategy.
- Microsoft Azure. "Circuit Breaker pattern". Azure Architecture Center. Cloud design pattern reference covering circuit breaker state machines, failure detection, and integration with retry and timeout patterns.
- Daniel Sim. pybreaker. Python circuit breaker implementation following Michael Nygard's design. Supports failure thresholds, reset timeouts, exclusion lists, and listener callbacks for monitoring.