Python Rate Limiting for OpenAI API Calls: Managing Tokens and Requests Per Minute

OpenAI rate limits are not a single number -- they are a multi-dimensional constraint system with hidden behaviors that catch even experienced developers off guard. You can hit the requests-per-minute ceiling while having plenty of token budget left, a single large prompt can exhaust your tokens-per-minute quota in one call, failed requests still count against your limits, and rate limits can be quantized into intervals shorter than the stated window. Managing all of these dimensions simultaneously is the key challenge when building production applications on the OpenAI API.

But there is a more fundamental way to think about this problem. Rate limiting an LLM API is not the same as rate limiting a traditional REST endpoint. A conventional API call has a roughly predictable cost -- one request, one unit of work. An LLM call is asymmetric: the cost of a single request can vary by orders of magnitude depending on prompt size, output length, and model choice. This asymmetry means that the strategies developers carry over from conventional API work -- simple request counters, fixed retry intervals, uniform backoff -- quietly fail when applied to LLM traffic. The code patterns in this article are designed around that asymmetry from the ground up.

This article covers the full picture of managing OpenAI rate limits in Python: understanding the five dimensions (RPM, TPM, RPD, TPD, and IPM), reading rate limit headers from API responses, pre-counting tokens with tiktoken, building a rate limiter that tracks both requests and tokens, adapting your throttle dynamically based on server feedback, implementing retry logic with tenacity, navigating the silent gotchas that drain your quota without warning, and applying a structured decision framework for batching, caching, and model routing.

Understanding the Five Rate Limit Dimensions

OpenAI enforces rate limits across five independent dimensions. Exceeding any one of them triggers a 429 error, regardless of how much headroom you have on the others:

Dimension What It Measures Common Binding Scenario
RPM (Requests Per Minute) Number of API calls in a rolling 60-second window High-frequency small requests: per-user chat, classification pipelines
TPM (Tokens Per Minute) Total input + output tokens across all requests in 60 seconds Large prompts: document summarization, code generation with long context
RPD (Requests Per Day) Total API calls in a rolling 24-hour window Batch processing jobs that run continuously
TPD (Tokens Per Day) Total tokens across all requests in 24 hours High-volume data processing pipelines
IPM (Images Per Minute) Number of image generation requests in 60 seconds Applications calling DALL-E or GPT Image models for visual content

The critical detail that catches developers off guard is that TPM counts both input and output tokens combined. A single request with a 10,000-token prompt and a 2,000-token response consumes 12,000 tokens of your TPM budget -- even if you only made one request that minute. You can hit your TPM ceiling with a single large call while your RPM counter shows 1 out of 500.

Think of these five dimensions as independent pressure valves on the same pipeline. Each valve can shut off flow independently. Traditional rate limiting teaches developers to think in one dimension -- requests over time. LLM rate limiting forces you to think in at least two: request frequency and request weight. The same way a shipping company cares about both the number of packages and the total weight on a truck, OpenAI cares about both how often you call and how much computation each call demands. This dual-axis constraint is the core architectural reality that every design decision in this article flows from.

Note

Rate limits are enforced at both the organization level and the project level, not per API key or per user. If your team has five developers using separate keys under the same organization, all five share the same RPM, TPM, RPD, and TPD pools. Creating more keys does not increase your limits. You can set project-level limits in the OpenAI developer console to subdivide your organization's quota across different applications or teams -- useful for preventing a background batch job from starving your production API.

Shared Model Limits

Some model families share rate limit pools. Any models listed under a "shared limit" on your organization's limits page draw from the same combined quota. For example, if a group of models shares a 3.5 million TPM allocation, every call to any model in that group counts against that single pool. Before you assume that switching models avoids rate limits, check the limits page to confirm whether the models you are routing between have independent or shared quotas.

Rate Limit Quantization

OpenAI can enforce rate limits in intervals shorter than the stated window. An RPM limit of 60 may be quantized as 1 request per second internally, meaning a burst of 10 requests in one second triggers a 429 even though your average rate over the full minute is well within bounds. This is one of the more confusing sources of 429 errors. If you are seeing rate limit errors despite being under your stated RPM on a per-minute basis, quantization is likely the cause. The fix is to spread requests evenly across the window rather than sending them in bursts.

Limits also vary by model and tier. Smaller models like GPT-4o mini typically offer roughly 10x the TPM of GPT-4o at the same tier. Long-context models like GPT-4.1 have separate rate limits specifically for long-context requests, which means a single large-context call can trip a dedicated limit that does not affect your standard-context quota. Your tier increases automatically as your API spending history grows, unlocking progressively higher limits. You can view your current tier and exact per-model limits in the OpenAI developer console under Settings, then Limits.

Reading Rate Limit Headers from API Responses

Every OpenAI API response includes headers that tell you exactly where you stand against your limits. These are the key headers to monitor:

from openai import OpenAI

client = OpenAI()


def inspect_rate_limits(model: str = "gpt-4o-mini") -> dict:
    """
    Make a lightweight API call and extract rate limit headers.

    Returns a dict with current RPM and TPM status.
    """
    response = client.chat.completions.with_raw_response.create(
        model=model,
        messages=[{"role": "user", "content": "Hi"}],
        max_tokens=5,
    )

    headers = response.headers
    return {
        "rpm_limit": headers.get("x-ratelimit-limit-requests"),
        "rpm_remaining": headers.get("x-ratelimit-remaining-requests"),
        "rpm_reset": headers.get("x-ratelimit-reset-requests"),
        "tpm_limit": headers.get("x-ratelimit-limit-tokens"),
        "tpm_remaining": headers.get("x-ratelimit-remaining-tokens"),
        "tpm_reset": headers.get("x-ratelimit-reset-tokens"),
    }


limits = inspect_rate_limits()
print(limits)

The with_raw_response method gives you access to the full HTTP response including headers, which the standard .create() method does not expose directly. The x-ratelimit-remaining-* headers are the ones to watch in real time -- they tell you how many requests and tokens you have left before the next reset.

Here is the deeper insight: these headers are not just diagnostic information. They are a feedback signal from the server that your application should be consuming and acting on continuously. The difference between a fragile rate-limited application and a robust one is whether it treats the server's state as the source of truth or its own local counters. Local counters drift. Server headers do not. The adaptive limiting section below builds directly on this principle.

Pro Tip

The x-ratelimit-reset-* headers tell you exactly when your rate limit window resets, formatted as a duration string like 6m30s. When you receive a 429 error, always check the retry-after header first -- it gives you the exact wait time.

Counting Tokens Before You Send with tiktoken

Since TPM is often the binding constraint, knowing how many tokens a request will consume before you send it lets you throttle proactively rather than reactively. OpenAI's tiktoken library provides exact token counts for any model:

import tiktoken


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count the exact number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def estimate_request_tokens(
    messages: list[dict],
    model: str = "gpt-4o",
    max_tokens: int = 1000,
) -> int:
    """
    Estimate total tokens for a chat completion request.

    Counts input tokens from messages and adds the max_tokens
    budget for the expected output.
    """
    encoding = tiktoken.encoding_for_model(model)
    input_tokens = 0

    for message in messages:
        # Each message has overhead tokens for role and formatting
        input_tokens += 4  # message overhead
        input_tokens += len(encoding.encode(message["content"]))

    input_tokens += 2  # reply priming tokens
    return input_tokens + max_tokens


# Example: check before sending
messages = [{"role": "user", "content": "Explain rate limiting in 3 sentences."}]
estimated = estimate_request_tokens(messages, max_tokens=200)
print(f"Estimated total tokens: {estimated}")

The estimate_request_tokens function adds the input token count to the max_tokens parameter because OpenAI counts the maximum potential output against your TPM, not the tokens the model produces. Setting max_tokens to 4000 when you only expect a 200-token response wastes 3800 tokens of your TPM budget on every call.

This is worth pausing on because it reveals a non-obvious design tension. Setting max_tokens too low risks truncating useful output. Setting it too high wastes TPM capacity. The optimal value requires you to understand your workload's output distribution -- not just the average, but the tail. If 95% of your responses fit in 300 tokens but 5% need 1200, you need to decide whether to cap at 400 and accept occasional truncation, or set 1200 and accept the TPM overhead on every call. This is a resource allocation tradeoff, not a configuration detail. It connects directly to the decision framework later in this article.

Warning

Always set max_tokens as close to your expected response length as possible. OpenAI calculates your TPM usage as the maximum of the actual token count and the max_tokens parameter. An unnecessarily high max_tokens value drains your TPM quota even if the model generates a short response.

Building a Dual-Dimension Rate Limiter

A standard rate limiter tracks only requests. For OpenAI, you need one that tracks both requests and tokens simultaneously:

import asyncio
import time


class OpenAIRateLimiter:
    """
    Dual-dimension rate limiter for OpenAI API calls.

    Tracks both requests per minute and tokens per minute,
    pausing coroutines when either limit is approached.
    """

    def __init__(self, max_rpm: int, max_tpm: int):
        self.max_rpm = max_rpm
        self.max_tpm = max_tpm
        self.remaining_rpm = max_rpm
        self.remaining_tpm = max_tpm
        self.reset_time_rpm = time.monotonic() + 60
        self.reset_time_tpm = time.monotonic() + 60
        self._lock = asyncio.Lock()

    def _maybe_reset(self) -> None:
        """Reset counters if the 60-second window has passed."""
        now = time.monotonic()
        if now >= self.reset_time_rpm:
            self.remaining_rpm = self.max_rpm
            self.reset_time_rpm = now + 60
        if now >= self.reset_time_tpm:
            self.remaining_tpm = self.max_tpm
            self.reset_time_tpm = now + 60

    async def acquire(self, estimated_tokens: int) -> None:
        """
        Wait until both RPM and TPM have sufficient capacity.

        Call this before each API request, passing the estimated
        total tokens (input + max_tokens) for the request.
        """
        while True:
            async with self._lock:
                self._maybe_reset()
                if self.remaining_rpm > 0 and self.remaining_tpm >= estimated_tokens:
                    self.remaining_rpm -= 1
                    self.remaining_tpm -= estimated_tokens
                    return

            # Not enough capacity -- sleep and retry
            sleep_time = min(
                max(0, self.reset_time_rpm - time.monotonic()),
                max(0, self.reset_time_tpm - time.monotonic()),
                1.0,  # check at least every second
            )
            await asyncio.sleep(sleep_time)

    def update_from_headers(self, headers: dict) -> None:
        """
        Sync internal state with server-reported limits.

        Call this after each successful API response to stay
        aligned with OpenAI's actual counters.
        """
        remaining_req = headers.get("x-ratelimit-remaining-requests")
        remaining_tok = headers.get("x-ratelimit-remaining-tokens")
        if remaining_req is not None:
            self.remaining_rpm = int(remaining_req)
        if remaining_tok is not None:
            self.remaining_tpm = int(remaining_tok)

The update_from_headers() method is what makes this limiter production-grade. Rather than relying solely on your local estimate of token usage, it synchronizes with the actual remaining capacity reported by OpenAI's servers after each response. This corrects for any drift between your local accounting and the server's view.

Adaptive Limiting: Letting the Server Guide Your Throttle

The dual-dimension limiter above tracks RPM and TPM locally, but it treats the limits as static numbers you configure once at startup. In production, this assumption breaks. OpenAI can adjust your effective limits dynamically -- quantization can tighten the burst window, shared model pools shift as other applications in your organization consume capacity, and tier upgrades silently raise your ceiling mid-session. A truly resilient rate limiter should adapt.

The idea is straightforward: instead of sleeping for a fixed duration when capacity runs out, let the remaining-capacity headers from each response continuously adjust your sending rate. When the server reports high remaining capacity, send faster. When remaining capacity drops toward zero, slow down before you hit the wall rather than after.

import asyncio
import time


class AdaptiveRateLimiter:
    """
    Rate limiter that adjusts sending pace based on
    server-reported remaining capacity.

    Instead of hard-stopping at a local counter, this limiter
    introduces proportional delays as remaining capacity shrinks,
    creating a smooth deceleration curve rather than a cliff.
    """

    def __init__(self, max_rpm: int, max_tpm: int):
        self.max_rpm = max_rpm
        self.max_tpm = max_tpm
        self.remaining_rpm = max_rpm
        self.remaining_tpm = max_tpm
        self.reset_time = time.monotonic() + 60
        self._lock = asyncio.Lock()

    def _time_until_reset(self) -> float:
        return max(0, self.reset_time - time.monotonic())

    def _calculate_delay(self, estimated_tokens: int) -> float:
        """
        Calculate a proportional delay based on remaining capacity.

        Returns 0 when capacity is plentiful, and progressively
        longer delays as capacity drops below 20% of the max.
        """
        rpm_ratio = self.remaining_rpm / self.max_rpm
        tpm_ratio = self.remaining_tpm / self.max_tpm

        # Use the tighter of the two dimensions
        capacity_ratio = min(rpm_ratio, tpm_ratio)

        if capacity_ratio > 0.2:
            return 0  # plenty of headroom

        # Below 20% capacity: introduce delay proportional to scarcity
        # At 10% remaining, delay ~3s; at 5%, delay ~6s
        time_left = self._time_until_reset()
        scarcity = 1 - (capacity_ratio / 0.2)
        return min(scarcity * time_left * 0.1, 10.0)

    async def acquire(self, estimated_tokens: int) -> None:
        """Wait based on current capacity, then reserve resources."""
        async with self._lock:
            now = time.monotonic()
            if now >= self.reset_time:
                self.remaining_rpm = self.max_rpm
                self.remaining_tpm = self.max_tpm
                self.reset_time = now + 60

            delay = self._calculate_delay(estimated_tokens)

        if delay > 0:
            await asyncio.sleep(delay)

        async with self._lock:
            self.remaining_rpm -= 1
            self.remaining_tpm -= estimated_tokens

    def update_from_headers(self, headers: dict) -> None:
        """Sync state with server-reported remaining capacity."""
        remaining_req = headers.get("x-ratelimit-remaining-requests")
        remaining_tok = headers.get("x-ratelimit-remaining-tokens")
        if remaining_req is not None:
            self.remaining_rpm = int(remaining_req)
        if remaining_tok is not None:
            self.remaining_tpm = int(remaining_tok)

The key difference from the basic limiter is the _calculate_delay method. Instead of a binary gate -- wait or proceed -- it calculates a proportional delay based on how much capacity remains relative to the maximum. When your remaining budget is above 20%, there is no delay at all. Below 20%, the limiter starts injecting progressively longer pauses, creating a smooth deceleration curve rather than a hard stop followed by a burst when the window resets.

This matters more than it might seem. The hard-stop-then-burst pattern is one of the leading causes of quantization-triggered 429 errors. When your limiter blocks all requests, waits for the reset, then releases a queue of pending requests simultaneously, those requests arrive at the server as a burst -- exactly the pattern that quantized enforcement penalizes. An adaptive limiter that smoothly decelerates avoids this entirely.

Retry Logic with tenacity for 429 Errors

Even with proactive rate limiting, 429 errors can still occur -- especially when rate limits are quantized into shorter intervals. The tenacity library handles retries cleanly:

from openai import RateLimitError
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
)


@retry(
    retry=retry_if_exception_type(RateLimitError),
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(6),
)
async def call_openai(
    client,
    messages: list[dict],
    model: str = "gpt-4o-mini",
    max_tokens: int = 1000,
):
    """Make an OpenAI API call with automatic retry on rate limits."""
    return await client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
    )

The retry_if_exception_type(RateLimitError) filter ensures that only 429 errors trigger retries. Authentication errors, invalid requests, and other client errors propagate immediately without wasting retry attempts.

Gotchas That Silently Drain Your Quota

Several behaviors consume rate limit capacity in ways that are not immediately obvious. What connects them is a single principle: the OpenAI API charges your budget when the request is initiated, not when useful work is completed. Understanding this principle -- pay on send, not on success -- reframes every gotcha below from a surprising edge case into a predictable consequence of the system's design.

Failed Requests Still Count

A request that returns a 400 (bad request), a 422 (validation error), or even a 429 (rate limit exceeded) still counts against your rate limit. This means a retry loop hammering the API with malformed requests is burning through your RPM and potentially your TPM budget with every failed attempt. The implication is architectural: validation is not just a correctness concern, it is a rate-limit concern. Every malformed request that reaches the API is capacity you cannot use for productive work. Always validate your request payload before sending it, and make sure your retry logic backs off meaningfully rather than immediately resending.

This also creates a feedback loop worth being aware of. When you hit a 429, the natural instinct is to retry. But the retry itself counts against your limit, which pushes your remaining capacity even lower, which makes the next request more likely to also hit a 429. Without exponential backoff and jitter, a naive retry loop can enter a death spiral where every retry deepens the rate limit pressure rather than relieving it. The tenacity configuration above is specifically designed to break this cycle.

Streaming Uses the Same Pools

Streaming responses via stream=True does not bypass or reduce rate limit consumption. The response is delivered token-by-token in real time, but the full token count -- both prompt tokens and completion tokens -- is consumed against your TPM limit when the request is initiated. There is no discount for streaming. If you are building a streaming chat interface and wondering why your TPM is depleting faster than expected, the answer is that every streamed response consumes the same quota as a non-streamed one.

Rolling Windows, Not Fixed Resets

OpenAI uses rolling 60-second and rolling 24-hour windows. There is no top-of-the-minute refresh or midnight reset. If you make 100 requests between 14:00:30 and 14:01:30, those requests count together in any 60-second window that overlaps that span. This means you cannot simply wait until the next "round" minute and expect a clean slate. Your local rate limiter should track timestamps, not clock boundaries.

The rolling window also explains why the adaptive limiter is preferable to a fixed-window limiter. A fixed-window implementation that resets on the minute boundary can allow a burst of requests at 14:00:59 followed by another burst at 14:01:01 -- two seconds apart but in "different" local windows. From the server's perspective, both bursts fall within the same rolling 60-second window. Tracking timestamps and smoothing your send rate across the full window eliminates this class of error entirely.

Context Window Limits Are Not Rate Limits

Rate limits control how many requests and tokens you can process over time. Context window limits control the maximum size of a single request. These are separate constraints. A model with a 128,000-token context window and a 200,000 TPM limit can accept a single 128,000-token request, but that one request consumes 64% of the per-minute token budget. Confusing these two concepts leads to architects who assume they can send back-to-back large-context requests without hitting TPM ceilings.

The Throughput Decision Framework

Batching, caching, and model selection are not three independent optimizations. They are three levers on the same underlying tradeoff: how to maximize useful work per unit of rate-limited capacity. Treating them as a connected decision framework rather than a checklist of tips changes how you architect LLM-powered systems from the ground up.

Step 1: Identify Your Binding Constraint

Before optimizing anything, determine which dimension is limiting your throughput. If you are RPM-bound (hitting request limits with TPM headroom remaining), the correct lever is batching -- combining multiple tasks into fewer, larger requests. If you are TPM-bound (hitting token limits with RPM headroom remaining), the correct lever is model routing or prompt compression. If you are bound on both simultaneously, you need caching to reduce total demand before the other levers can help.

This diagnosis step is not optional. Batching when you are TPM-bound makes things worse -- you are trading the resource you have (RPM headroom) for the resource you need (TPM headroom), in the wrong direction. The rate limit headers tell you exactly which dimension is tighter. Read them, log them, and let the data drive your optimization choices.

Step 2: Reduce Total Demand with Caching

Caching eliminates redundant API calls entirely. If your application frequently sends the same or similar prompts, cache the responses locally. A simple dictionary cache keyed by the prompt hash can eliminate a significant percentage of repeat calls. For production applications, use Redis or a database-backed cache with a TTL appropriate for your data freshness requirements.

Caching is the only optimization that reduces pressure on every dimension simultaneously -- fewer requests means lower RPM, fewer tokens means lower TPM, and you save money on every cached hit. It should be your first investment before tuning any other lever. Even a cache hit rate of 15-20% can be the difference between comfortable headroom and chronic 429 errors under load.

Step 3: Route by Model Based on Task Complexity

GPT-4o mini typically offers roughly 10x the TPM allowance of GPT-4o at the same tier. If your workload does not strictly require the larger model's capabilities for every request, routing simpler tasks to the mini model can unlock dramatically higher throughput without a tier upgrade. But check the shared model limits on your organization's limits page first -- if both models share a pool, routing between them provides no throughput benefit at all.

The decision should be driven by task requirements, not by default. A common pattern is to classify incoming requests by complexity -- short factual lookups, simple classification tasks, and format conversions go to the smaller model, while nuanced analysis, creative generation, and multi-step reasoning go to the larger one. This classification layer itself can be lightweight: a rule-based router that checks prompt length and task type, or a fast local model that triages the request before it reaches the API.

Step 4: Batch When RPM-Bound

If you have 50 short classification tasks, you can bundle them into a single prompt rather than sending 50 individual requests. This trades RPM capacity for TPM capacity -- one request instead of fifty, but with a larger token count. If you are RPM-constrained but have TPM headroom, batching is the immediate fix. If your workload supports it, batching also reduces latency variance because you eliminate 49 round trips worth of network overhead.

Pro Tip

For large batch jobs that do not need real-time responses, use OpenAI's Batch API. It runs asynchronously and has separate, higher rate limits than the synchronous endpoints. This is ideal for processing thousands of prompts overnight. The Batch API operates outside your synchronous rate limit pools entirely, so it does not compete with your real-time traffic for capacity.

Key Takeaways

  1. LLM rate limiting is fundamentally different from traditional API rate limiting: The asymmetry between request count and request weight means that strategies inherited from conventional REST APIs -- simple counters, fixed backoff, uniform throttling -- silently fail when applied to LLM traffic. Design your rate limiting around the dual-axis reality of RPM and TPM from the start.
  2. OpenAI rate limits are multi-dimensional: RPM, TPM, RPD, TPD, and IPM are enforced independently. Exceeding any single dimension triggers a 429, even if the others have capacity. Your rate limiting code must track both requests and tokens at minimum.
  3. Limits are scoped to organizations and projects, not keys: Creating more API keys does not increase your limits. Use project-level limits in the developer console to prevent one workload from starving another.
  4. TPM is often the binding constraint: A single large prompt can exhaust your TPM budget in one call. Set max_tokens as close to your expected response length as possible -- the maximum potential output counts against your TPM, not the tokens produced. Understand your output distribution to make this tradeoff deliberately.
  5. Treat server headers as your source of truth: The x-ratelimit-remaining-* headers tell you exactly how much capacity you have left. Use these to synchronize your local rate limiter with the server's state, and build adaptive throttling that responds to real capacity rather than static estimates.
  6. Pre-count tokens with tiktoken: Estimating token usage before sending a request lets you throttle proactively rather than waiting for 429 errors to tell you that you have exceeded the limit.
  7. Smooth your send rate to defeat quantization: Adaptive rate limiting that decelerates proportionally to remaining capacity avoids the burst-then-block pattern that triggers quantized rate limit enforcement. Spread requests evenly across the window rather than sending them in clusters.
  8. Use tenacity for retry logic: Filter retries to RateLimitError only, use random exponential backoff to avoid thundering herd problems, and cap the maximum number of retry attempts. Remember that retries themselves consume rate limit capacity.
  9. The pay-on-send principle explains every gotcha: Failed requests still count against your limits. Streaming uses the same pools as standard calls. Rolling windows mean there is no clean reset point. These are all consequences of the same design: OpenAI charges your budget at request initiation, not at successful completion.
  10. Optimize as a connected decision framework: Diagnose your binding constraint first. Cache to reduce total demand. Route by model to increase TPM ceiling. Batch to reduce RPM pressure. The order matters -- applying the wrong lever to the wrong constraint wastes effort or makes things worse.

The OpenAI API's rate limiting model is more nuanced than a simple requests-per-second cap. It requires thinking about request weight alongside request frequency, treating server feedback as a continuous signal rather than an error condition, and understanding how every design decision -- from max_tokens tuning to model routing -- connects back to the same finite pool of capacity. The patterns in this article give you the tools to manage all of this in Python. More importantly, they give you a mental model for reasoning about LLM API constraints that will remain useful as rate limit numbers change, new models launch, and the specific parameters shift beneath the principles that govern them.