What is the difference between FastAPI middleware and dependency injection for rate limiting?

Middleware intercepts every request before it reaches any endpoint, making it ideal for global rate limits applied across the entire API. Dependency injection using FastAPI's Depends() is applied per-route, letting you set different limits for different endpoints. You can use middleware for a global baseline and dependencies for endpoint-specific overrides.

How do I identify clients for rate limiting in FastAPI?

The simplest approach uses the client IP from request.client.host. If your API sits behind a reverse proxy, check the X-Forwarded-For header instead. For authenticated APIs, use the API key or user ID from the request headers or JWT token for more accurate per-user limiting.

Should I use slowapi or build a custom rate limiter for FastAPI?

slowapi is a production-tested library adapted from flask-limiter that handles many edge cases out of the box. Build a custom rate limiter when you need full control over the algorithm, want to avoid external dependencies, or have specific requirements that slowapi does not support, such as tiered rate limits based on subscription plans.

Is application-level rate limiting enough to stop DDoS attacks?

Application-level rate limiting protects against single-client abuse, retry storms, and credential stuffing. It cannot stop distributed denial-of-service attacks where traffic arrives from thousands of IPs. For DDoS defense, you need edge-level protection from services like Cloudflare, AWS Shield, or your cloud provider's network-layer mitigations. Application-level rate limiting is one layer in a defense-in-depth strategy, not a standalone solution.

What are the new IETF RateLimit header fields?

The IETF httpapi working group has been developing a standard for rate limit headers (draft-ietf-httpapi-ratelimit-headers). It defines two structured header fields: RateLimit-Policy for advertising quota policies and RateLimit for communicating current usage. These are designed to replace the non-standard X-RateLimit-* headers that many APIs use today. The draft uses structured fields syntax with parameters like q (quota), w (window), r (remaining), and t (reset time).

How to Build a Custom Rate Limiter Middleware in FastAPI with Python

FastAPI gives you two clean mechanisms for intercepting requests: middleware and dependency injection. Both can enforce rate limits, but they serve different purposes. Middleware catches every request globally before it reaches any route. Dependencies are applied per-endpoint, giving you granular control. A well-designed rate limiting system uses both -- and it does so with a clear understanding of what it can and cannot protect against.

This article builds a rate limiter for FastAPI from the ground up, but it starts from a different place than the typical tutorial. Before writing a single line of middleware, you need to understand what threat you are mitigating, what an attacker would do to circumvent it, and where application-level rate limiting fits in the broader security stack. From there, you will build a reusable sliding window counter class, wrap it in Starlette's BaseHTTPMiddleware for global enforcement, create a parameterized Depends() factory for per-route limits, and learn when to reach for an existing library instead.

Rate Limiting as a Security Control

Rate limiting is not just a performance optimization. It is a security control that appears directly in the OWASP API Security Top 10 under API4:2023 Unrestricted Resource Consumption. The vulnerability is straightforward: if an API does not restrict how many requests a client can make, or how much data a single request can consume, an attacker can exhaust server resources, inflate cloud costs, or degrade service for every other user. As the OWASP specification puts it, the API is vulnerable if limits on the number of requests, payload sizes, or resource consumption are "missing or set inappropriately."

The threat landscape for APIs has shifted significantly. Automated tooling makes it trivial to launch credential stuffing attacks against login endpoints, enumerate user accounts through timing differences in responses, or scrape proprietary data faster than a human could browse it. Rate limiting addresses all of these by introducing a cost -- time -- that makes brute-force approaches impractical without eliminating access for legitimate users.

But the effectiveness of rate limiting depends on where you enforce it and how you identify clients. A rate limiter that trusts the X-Forwarded-For header without validation is trivially bypassed. A rate limiter that only counts requests by IP address cannot distinguish between a thousand users behind a corporate NAT and a single attacker with one IP. Every implementation decision in this article is shaped by these realities.

What an Attacker Sees

Before building a rate limiter, it is worth thinking about what your system looks like from the other side. An attacker probing your API for rate limiting behavior will typically try several approaches, and understanding them helps you design a limiter that holds up under adversarial conditions rather than just normal traffic.

Header spoofing is the lowest-effort bypass. If your rate limiter identifies clients by X-Forwarded-For, an attacker can rotate through fake IP addresses on every request. The rate limiter sees each request as coming from a new client and never trips the threshold. This is why the client identification function in this article checks for an API key first, falls back to the forwarded header only in trusted proxy environments, and defaults to the direct socket IP.

Distributed requests are harder to counter at the application level. A botnet or a pool of residential proxies can distribute requests across hundreds or thousands of source IPs. Each IP stays well under the per-client limit, but the aggregate volume overwhelms your backend. Application-level rate limiting cannot solve this -- it requires edge-level mitigation from a CDN or DDoS protection service. Recognizing this boundary is as important as building the limiter itself.

Slowloris-style abuse targets connection resources rather than request volume. An attacker opens many connections and sends data as slowly as possible, tying up server threads or event loop slots without triggering request-based rate limits. This is a reminder that rate limiting is not a complete defense. It works alongside connection timeouts, request body size limits, and proper ASGI server configuration. Uvicorn's --limit-concurrency flag caps the number of concurrent connections allowed before returning 503, and --timeout-keep-alive (default: 5 seconds) controls how long idle connections persist. Both settings are documented in the Uvicorn settings reference.

Endpoint-targeted abuse exploits the fact that different endpoints have different costs. A search endpoint that queries a database is far more expensive than a health check that returns a static response. An attacker who stays under the global rate limit but hammers only the most expensive endpoint can still degrade your service. This is precisely why per-route rate limiting through Depends() exists -- it lets you assign limits proportional to the cost of each operation.

Warning

Rate limiting can also leak information. If your rate limiter returns different responses for valid vs. invalid usernames on a login endpoint (for example, rate limiting only after successful lookups), an attacker can use the rate limit behavior itself to enumerate valid accounts. Apply rate limits consistently regardless of the outcome of the underlying operation.

Middleware vs. Dependencies: When to Use Each

FastAPI is built on Starlette, which means it inherits Starlette's middleware system. Middleware wraps the entire request/response cycle -- it runs before the request reaches your route function and after the response leaves it. This makes middleware ideal for global policies: if every endpoint should be rate limited, middleware ensures nothing slips through.

Dependencies, on the other hand, are declared per-route using Depends(). They run after middleware but before your endpoint function. This lets you set different limits for different routes. A login endpoint might allow 5 requests per minute, while a read-only data endpoint allows 100. You cannot achieve this kind of granularity with middleware alone.

The practical pattern is to layer both: middleware enforces a generous global ceiling (say, 200 requests per minute per IP), and dependencies enforce tighter per-route limits where needed. Think of it as defense in depth applied to traffic management. The global layer is your safety net -- it catches runaway clients and basic abuse without needing to know anything about your individual endpoints. The per-route layer is your precision tool -- it protects expensive operations and sensitive endpoints according to their specific risk profile.

Note

FastAPI's BaseHTTPMiddleware creates a separate task and several intermediate objects for each request, which adds measurable overhead. It also has a known limitation: changes to contextvars.ContextVar values made downstream will not propagate back upstream through the middleware (see the Starlette documentation). For extremely high-throughput services (tens of thousands of requests per second), consider using a pure ASGI middleware instead -- benchmarks show significant throughput gains from dropping down to the ASGI level. For the vast majority of APIs, BaseHTTPMiddleware is perfectly fine and far simpler to implement.

Choosing the Right Algorithm

Before building the rate limiter, it helps to understand why this article uses a sliding window counter instead of the alternatives. Each algorithm trades off between memory efficiency, accuracy, and burst tolerance.

Algorithm	Memory per Client	Burst Handling	Accuracy	Complexity
Fixed Window	1 counter + timestamp	Allows 2x burst at window edges	Low (boundary problem)	Trivial
Sliding Window Log	1 timestamp per request	Precise -- no burst gap	Exact	High memory at scale
Sliding Window Counter	2 counters + window ID	Smooth -- weighted estimate	Near-exact (within ~1-2%)	Low
Token Bucket	Token count + timestamp	Allows controlled bursts	Exact for its model	Moderate
Leaky Bucket	Queue length + timestamp	Smooths all bursts	Exact for its model	Moderate

The fixed window algorithm is the simplest: count requests in a time window, reject when the count exceeds the limit. The problem is the boundary condition. If a client sends 100 requests at 11:59:59 and another 100 at 12:00:01, they get 200 requests in two seconds while the "limit" is 100 per minute. This is the well-known double-burst problem, and it is often the first thing an attacker tests.

The sliding window log eliminates this by storing the timestamp of every request and counting only those within the trailing window. It is perfectly accurate but memory-hungry -- a client making 1000 requests per minute consumes 1000 timestamps.

The sliding window counter is the pragmatic middle ground. It tracks only two integers per client (current window count and previous window count) and interpolates between them based on how far into the current window you are. The estimate is within 1-2% of the exact count, and memory stays constant regardless of request volume. Cloudflare uses this exact approach in their production rate limiter, and in their engineering blog they report that across 400 million requests from 270,000 distinct sources, the algorithm misclassified only 0.003% of requests, with an average difference of 6% between the real rate and the approximate rate. That is the algorithm this article implements.

The token bucket and leaky bucket take a different approach entirely. Instead of counting requests in a window, the token bucket fills at a steady rate and each request consumes a token. This naturally allows short bursts (the bucket can accumulate unused tokens) while enforcing a long-term average rate. The leaky bucket is the inverse: it smooths requests into a steady outflow, never allowing bursts. Both are common in network-level rate limiting and in libraries like pyrate-limiter (which powers fastapi-limiter).

Cloudflare's engineering team validated the sliding window counter extensively in production and found it performed well enough to handle their scale -- it smoothed the boundary spike problem inherent to fixed windows while remaining straightforward to configure and reason about. Their findings are detailed in the engineering blog post on building rate limiting at scale.

Beyond Basic Counting: Advanced Rate Limiting Strategies

The algorithms above cover the standard approaches, but production APIs often face problems that no single algorithm solves. Several advanced strategies deserve consideration when you move past tutorial-level implementations.

Adaptive rate limiting adjusts thresholds dynamically based on server load rather than relying on fixed values. Instead of always allowing 200 requests per minute, the limiter monitors CPU usage, memory pressure, or response latency and tightens limits when the system is under stress. The implementation is straightforward: read a health metric on each request check and multiply the configured limit by a load factor between 0.0 and 1.0. When your database is struggling, the limiter automatically becomes more aggressive without requiring manual intervention or a deployment. This approach is especially valuable for services with unpredictable traffic patterns where static limits are either too generous during load spikes or too restrictive during normal operation.

Cost-weighted rate limiting assigns different weights to different operations instead of treating every request equally. A search query that triggers a full-text database scan might cost 10 units, while a cached health check costs 1. The client's budget depletes faster when they hit expensive endpoints, even if their raw request count is low. This prevents a client from staying under a simple request count limit while still overwhelming your backend by targeting your heaviest operations. You can implement this by extending the sliding window counter to accept a cost parameter in the is_allowed method and incrementing entry["current"] by the cost rather than by 1.

Hierarchical quotas with priority queues go beyond simple per-user limits by organizing clients into tiers with shared and individual budgets. A free-tier user might have a personal limit of 100 requests per hour within an organizational pool of 1,000 per hour shared across all free-tier users. When the organizational pool is exhausted, individual free-tier clients are throttled even if they have not hit their personal limit -- but premium-tier clients continue unaffected. This requires maintaining counters at multiple granularity levels (individual, tier, organization) and checking all applicable levels on each request. The complexity is significant, but for SaaS products with tiered pricing, it accurately reflects the business model in the rate limiting layer rather than bolting pricing logic on top of a flat limiter.

Request fingerprinting beyond IP and API key addresses a fundamental weakness of traditional client identification. Advanced fingerprinting combines multiple request signals -- TLS client hello characteristics (JA3/JA4 fingerprints), HTTP/2 settings frames, header ordering patterns, and Accept-Language consistency -- to build a composite client identity. Two requests from the same IP but with different TLS fingerprints are likely different clients; conversely, requests from rotating IPs but with identical JA3 hashes are likely the same client using a proxy pool. This technique is used by sophisticated edge providers and can be partially implemented at the application level by hashing a combination of request characteristics into the rate limit key.

Circuit breaker integration treats rate limiting as part of a broader resilience pattern rather than an isolated feature. When a downstream dependency (database, external API, cache layer) starts failing, a circuit breaker opens and the rate limiter simultaneously tightens to reduce the pressure that caused the failure in the first place. Without this coordination, a degraded dependency can still receive traffic at full rate because the rate limiter has no awareness of backend health. Implementing this requires a shared state object that both the circuit breaker and rate limiter can read -- when the circuit breaker transitions from closed to open, it sets a flag that causes the rate limiter to drop its effective limit by a configurable percentage until the circuit closes again.

Graduated response escalation replaces the binary allowed/rejected model with a spectrum of responses. At 80% of the limit, responses begin including Warning headers. At 90%, non-essential response fields are stripped to reduce server-side processing cost. At 100%, the standard 429 response fires. This gives well-behaved clients the opportunity to back off gracefully before hitting the wall, while poorly-behaved clients still get hard-capped. The implementation adds conditional logic after the is_allowed check that inspects the remaining value and modifies the response accordingly, which requires the middleware to wrap the downstream response rather than just gating it.

Building the Rate Limiter Core

With the algorithm chosen, here is a sliding window counter that works in-memory -- suitable for single-process deployments:

import time
import threading
from dataclasses import dataclass, field


@dataclass
class SlidingWindowLimiter:
    """
    In-memory sliding window counter rate limiter.

    Uses weighted counts from current and previous windows
    to approximate a sliding window with constant memory.
    """
    limit: int
    window_seconds: int
    _counters: dict[str, dict] = field(default_factory=dict, init=False)
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False)

    def is_allowed(self, key: str) -> tuple[bool, dict]:
        """
        Check if a request is allowed for the given key.

        Returns (allowed, info) where info contains limit,
        remaining, and reset values.
        """
        with self._lock:
            now = time.time()
            window_id = int(now // self.window_seconds)
            elapsed = now - (window_id * self.window_seconds)
            prev_weight = 1 - (elapsed / self.window_seconds)

            if key not in self._counters:
                self._counters[key] = {
                    "current": 0,
                    "previous": 0,
                    "window_id": window_id,
                }

            entry = self._counters[key]

            # Roll the window forward if needed
            if entry["window_id"] < window_id:
                if entry["window_id"] == window_id - 1:
                    entry["previous"] = entry["current"]
                else:
                    entry["previous"] = 0
                entry["current"] = 0
                entry["window_id"] = window_id

            estimated = (entry["previous"] * prev_weight) + entry["current"]

            if estimated >= self.limit:
                reset = self.window_seconds - elapsed
                return False, {
                    "limit": self.limit,
                    "remaining": 0,
                    "reset": round(reset, 1),
                }

            entry["current"] += 1
            remaining = max(0, int(self.limit - estimated - 1))

            return True, {
                "limit": self.limit,
                "remaining": remaining,
                "reset": round(self.window_seconds - elapsed, 1),
            }

The sliding window counter uses only two integers per client (current and previous window counts) instead of storing individual timestamps. This keeps memory consumption constant regardless of how many requests each client sends. The threading.Lock ensures safe access when FastAPI runs with multiple threads behind Uvicorn.

There is a subtle design choice worth noting. The estimated calculation on line 20 uses prev_weight, which decreases linearly from 1.0 to 0.0 as the current window progresses. At the start of a new window, the previous window's count has full weight. Halfway through, it has half weight. This weighted blending is what eliminates the fixed-window boundary problem -- there is no sharp reset point where an attacker can double their throughput.

Production Consideration

The in-memory _counters dictionary will grow without bound as new client keys appear. In production, you should add a periodic cleanup that evicts entries whose window_id is more than two windows behind the current one. A background thread or an asyncio task running every few minutes is sufficient. For multi-process deployments (multiple Uvicorn workers), this in-memory approach will not work correctly because each process maintains a separate counter -- use Redis or another shared store instead.

Extracting the Client Identity

Rate limiting is only effective if you correctly identify who is making each request. This is more nuanced than it appears. A shared utility function handles the common cases, but the ordering of checks matters:

from fastapi import Request


def get_client_key(request: Request) -> str:
    """
    Extract a rate limit key from the request.

    Priority: API key > X-Forwarded-For > direct IP.
    API keys give per-user accuracy.
    X-Forwarded-For requires a trusted proxy.
    Direct IP is the safest fallback.
    """
    # Check for API key first (most specific)
    api_key = request.headers.get("X-API-Key")
    if api_key:
        return f"apikey:{api_key}"

    # Check forwarded header for proxy setups
    forwarded = request.headers.get("X-Forwarded-For")
    if forwarded:
        return f"ip:{forwarded.split(',')[0].strip()}"

    # Fall back to direct client IP
    if request.client:
        return f"ip:{request.client.host}"

    return "ip:unknown"

API key identification is the most reliable because it ties rate limits to a specific user or service account regardless of what IP they connect from. This also lets you implement tiered limits -- premium users get a higher quota, trial users get a lower one -- by looking up the key's plan before choosing a limiter configuration.

IP-based identification is the most common but the least reliable. All users behind a corporate NAT or VPN exit node share the same IP. On the other hand, a single attacker using a rotating proxy service can appear as thousands of different IPs. For unauthenticated endpoints, IP-based limiting is the best you can do, but you should set limits generously enough to avoid penalizing shared-IP environments while still catching obvious abuse.

Warning

Never trust X-Forwarded-For blindly. Clients can spoof this header to bypass IP-based rate limits. If your API sits behind a trusted reverse proxy (Nginx, AWS ALB, Cloudflare), configure that proxy to set the header and strip any client-provided values. If your API is directly exposed, ignore X-Forwarded-For entirely and use request.client.host.

Global Rate Limiting with BaseHTTPMiddleware

Starlette's BaseHTTPMiddleware provides a dispatch method that intercepts every request. You override this method to check the rate limit before passing the request through to the actual endpoint:

from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse
from fastapi import FastAPI, Request
from collections.abc import Callable


class RateLimitMiddleware(BaseHTTPMiddleware):
    """
    Global rate limiting middleware for FastAPI.

    Intercepts every request, checks the rate limit, and
    returns 429 with proper headers if the limit is exceeded.
    """

    def __init__(
        self,
        app,
        limiter: SlidingWindowLimiter,
        key_func: Callable[[Request], str] = None,
        exclude_paths: list[str] = None,
    ):
        super().__init__(app)
        self.limiter = limiter
        self.key_func = key_func or get_client_key
        self.exclude_paths = exclude_paths or []

    async def dispatch(self, request: Request, call_next):
        # Skip rate limiting for excluded paths
        if request.url.path in self.exclude_paths:
            return await call_next(request)

        key = self.key_func(request)
        allowed, info = self.limiter.is_allowed(key)

        if not allowed:
            return JSONResponse(
                status_code=429,
                content={
                    "detail": "Rate limit exceeded",
                    "retry_after": info["reset"],
                },
                headers={
                    "Retry-After": str(int(info["reset"]) + 1),
                    "X-RateLimit-Limit": str(info["limit"]),
                    "X-RateLimit-Remaining": "0",
                    "X-RateLimit-Reset": str(int(info["reset"])),
                },
            )

        # Request is allowed -- pass through and add headers
        response = await call_next(request)
        response.headers["X-RateLimit-Limit"] = str(info["limit"])
        response.headers["X-RateLimit-Remaining"] = str(info["remaining"])
        response.headers["X-RateLimit-Reset"] = str(int(info["reset"]))
        return response

Wiring it into the app takes two lines:

app = FastAPI()

# Global: 200 requests per minute per client
global_limiter = SlidingWindowLimiter(limit=200, window_seconds=60)

app.add_middleware(
    RateLimitMiddleware,
    limiter=global_limiter,
    exclude_paths=["/health", "/metrics", "/docs", "/openapi.json"],
)

The exclude_paths parameter is important in production. Health check endpoints are called frequently by load balancers and monitoring systems. If you rate limit them, your orchestrator might think your service is down when it is not. The /docs and /openapi.json paths serve FastAPI's built-in Swagger UI -- there is no reason to rate limit documentation.

Notice that the middleware returns the 429 response before calling call_next. This is critical from a resource perspective. A rejected request should consume as few server resources as possible -- no database queries, no business logic, no response serialization. The middleware acts as a gatekeeper that turns away excess traffic at the door rather than letting it through and trying to clean up afterward.

Per-Route Rate Limiting with Depends

For endpoints that need tighter or looser limits than the global middleware provides, FastAPI's dependency injection system is the right tool. You create a factory function that returns a dependency configured with specific limits:

from fastapi import HTTPException, Depends


# Store limiters by their configuration to avoid duplicates
_route_limiters: dict[str, SlidingWindowLimiter] = {}


def get_route_limiter(limit: int, window: int) -> SlidingWindowLimiter:
    """Get or create a limiter for the given configuration."""
    cache_key = f"{limit}:{window}"
    if cache_key not in _route_limiters:
        _route_limiters[cache_key] = SlidingWindowLimiter(
            limit=limit, window_seconds=window
        )
    return _route_limiters[cache_key]


def rate_limit(limit: int, window: int = 60):
    """
    Factory that returns a FastAPI dependency for per-route limiting.

    Usage:
        @app.get("/login", dependencies=[Depends(rate_limit(5, 60))])
        async def login(): ...
    """
    limiter = get_route_limiter(limit, window)

    async def check(request: Request):
        key = get_client_key(request) + f":{request.url.path}"
        allowed, info = limiter.is_allowed(key)

        if not allowed:
            raise HTTPException(
                status_code=429,
                detail={
                    "error": "Rate limit exceeded for this endpoint",
                    "retry_after": info["reset"],
                },
                headers={
                    "Retry-After": str(int(info["reset"]) + 1),
                },
            )

    return check

The dependency appends the request path to the client key. This is critical because the per-route limiter tracks a separate counter per endpoint. Without the path in the key, a client's requests to /login would count against their limit on /api/data, which is not the intended behavior.

Applying it to routes is clean and explicit:

@app.post("/auth/login", dependencies=[Depends(rate_limit(5, 60))])
async def login(credentials: dict):
    """Login endpoint: 5 requests per minute per client."""
    return {"token": "..."}


@app.get("/api/data", dependencies=[Depends(rate_limit(100, 60))])
async def get_data():
    """Data endpoint: 100 requests per minute per client."""
    return {"data": []}


@app.post("/api/upload", dependencies=[Depends(rate_limit(10, 60))])
async def upload_file():
    """Upload endpoint: 10 requests per minute per client."""
    return {"status": "uploaded"}

The limit values here are not arbitrary. The login endpoint gets the tightest limit (5 per minute) because it is a high-value target for credential stuffing and brute-force attacks. Five attempts per minute is enough for a legitimate user who mistyped their password, but it makes an automated attack take weeks to work through even a small dictionary. The upload endpoint gets 10 per minute because file uploads are resource-intensive -- each one involves disk I/O, virus scanning, or object storage writes. The data endpoint gets the most generous limit because read operations are cheap and legitimate clients may poll frequently.

Pro Tip

You can also apply per-route limits at the router level using APIRouter(dependencies=[Depends(rate_limit(50, 60))]). This applies the same limit to every endpoint in that router, which is useful for grouping endpoints by access tier -- for example, one router for free-tier routes with conservative limits and another for premium routes with higher limits.

Adding Rate Limit Headers to Every Response

Well-behaved APIs communicate rate limit status on every response, not just 429s. This is not just a courtesy -- it is a fundamental part of the security model. Clients that can see how close they are to the limit can self-regulate before they hit the wall. Clients that get no warning until they are blocked will hammer your API with retries, amplifying the very problem rate limiting is supposed to prevent.

The middleware above already adds headers to successful responses, but the per-route dependency raises an HTTPException on rejection, which does not automatically include the custom headers. You can fix this with a custom exception handler:

from fastapi.responses import JSONResponse
from fastapi.exceptions import HTTPException as FastAPIHTTPException


@app.exception_handler(FastAPIHTTPException)
async def rate_limit_exception_handler(request: Request, exc: FastAPIHTTPException):
    """Ensure rate limit headers appear on 429 responses."""
    headers = getattr(exc, "headers", None) or {}

    if exc.status_code == 429:
        return JSONResponse(
            status_code=429,
            content=exc.detail if isinstance(exc.detail, dict) else {"detail": exc.detail},
            headers=headers,
        )

    # Pass all other HTTP exceptions through unchanged
    return JSONResponse(
        status_code=exc.status_code,
        content={"detail": exc.detail},
        headers=headers,
    )

The three standard headers to include are X-RateLimit-Limit (total allowed requests in the window), X-RateLimit-Remaining (how many the client has left), and X-RateLimit-Reset (seconds until the window resets). The Retry-After header should only appear on 429 responses -- the 429 status code itself was defined in RFC 6585 specifically for rate limiting scenarios. While these X-RateLimit-* headers are widely used by major APIs (GitHub, Twitter/X, Stripe), they are not standardized -- which brings us to an emerging effort to change that.

Toward Standardized Headers: The IETF RateLimit Draft

The X-RateLimit-* headers used throughout this article are a de facto convention, but they have never been formally standardized. Every API that uses them defines slightly different semantics -- some express reset time as a Unix timestamp, others as seconds remaining; some count remaining requests differently depending on the algorithm.

The IETF httpapi working group has been developing a formal standard (draft-ietf-httpapi-ratelimit-headers) that defines two structured header fields to replace the ad-hoc convention. Authored by Roberto Polli (Italian Government Digital Team), Alejandro Martinez (Red Hat), and Darrel Miller (Microsoft), the draft specifies RateLimit-Policy to advertise the server's quota policies and RateLimit to communicate the current state of the client's usage against those policies. The draft uses HTTP structured fields syntax with parameters like q (quota), w (window in seconds), r (remaining), and t (reset time).

A response using the draft format would look like this:

# The policy: 100 requests per 60-second window
RateLimit-Policy: "default";q=100;w=60

# Current state: 67 remaining, resets in 45 seconds
RateLimit: "default";r=67;t=45

The draft also supports multiple policies (burst limits and daily limits in the same response), quota units beyond simple request counts (content-bytes, concurrent-requests), and partition keys that help clients understand which dimension they are being limited on. As of March 2026, the draft is at version 10 (published September 27, 2025, with an expiry date of March 31, 2026) and carries an intended status of Standards Track. It is not yet an RFC, but the Standards Track designation signals that the working group considers it on the path toward becoming one.

For now, the pragmatic approach is to continue using X-RateLimit-* headers for compatibility with existing clients, and to follow the draft's design principles (clear semantics, seconds-based reset, per-policy identification) so that migration is straightforward when the standard is finalized.

Existing Libraries: slowapi and fastapi-limiter

Building a custom rate limiter gives you full control, but there are cases where an established library is the better choice. Two stand out for FastAPI.

Library	Algorithm	Storage	Configuration Style	Best For
slowapi	Multiple (via limits library)	In-memory, Redis, Memcached	Decorator with string syntax: `"5/minute"`	Feature-rich rate limiting adapted from flask-limiter, production-tested at scale
fastapi-limiter	Leaky bucket (via pyrate-limiter)	In-memory, Redis	Depends() with Rate/Duration objects	Lightweight, native FastAPI dependency pattern, WebSocket support
Custom (this article)	Your choice	Your choice	Full control	Specific business logic, tiered limits, no external dependencies

slowapi (version 0.1.9) uses a familiar string-based syntax for limits like "5/minute" or "100/hour" and supports storage backends through the limits library. Adapted from flask-limiter, it describes itself as being "used in various production setups, handling millions of requests per month." The trade-off is that it introduces several transitive dependencies, and the project has not published a new PyPI release in over 12 months. Despite being functionally stable, the lack of recent releases means it may lag behind changes in its upstream dependencies.

fastapi-limiter (version 0.2.0, released February 6, 2026) takes a more FastAPI-native approach using Depends() with Rate and Duration objects from pyrate-limiter. It supports middleware-based application, has built-in WebSocket rate limiting via WebSocketRateLimiter, and offers a skip_limiter decorator for exempting specific routes. It is lighter than slowapi but requires a Redis connection for its default storage backend.

Build custom when you need tiered limits based on subscription plans, custom identification logic beyond IP or API key, integration with your existing auth system, or when you want zero external dependencies. The sliding window counter in this article is intentionally simple enough to understand, extend, and debug -- qualities that matter when rate limiting becomes a critical part of your security posture.

Two-layer rate limiting in FastAPI -- middleware enforces a global ceiling on every request, while per-route dependencies enforce tighter endpoint-specific limits. Both layers can independently reject with 429.

When Application-Level Rate Limiting Is Not Enough

Everything built in this article operates at the application layer -- inside your FastAPI process. This is the right place for per-user quotas, per-endpoint limits, and business-logic-aware throttling. But application-level rate limiting has inherent blindspots that you need to account for in a production security architecture.

The core limitation is that by the time a request reaches your FastAPI middleware, it has already consumed network bandwidth, passed through your load balancer, established a TCP connection, and completed a TLS handshake. A volumetric DDoS attack does not need to reach your application code to bring your service down -- it can saturate your network link or exhaust your connection pool before your rate limiter ever sees a request.

Rate limiting as defense in depth -- edge services stop volumetric floods, load balancers manage connections, and application-level limiters enforce per-user and per-endpoint business logic.

In a production environment, you want rate limiting at three layers. Edge-level protection (Cloudflare, AWS Shield, or your cloud provider's DDoS mitigation) absorbs volumetric attacks before they reach your network. Load balancer or reverse proxy limits (Nginx's limit_req, Envoy's rate limit filter) enforce connection and request-per-second limits per IP before traffic reaches your application processes. Application-level limits (the FastAPI middleware and dependencies in this article) enforce per-user, per-endpoint, and business-logic-aware quotas.

Each layer handles the threats that the layers before it cannot see. The edge has no knowledge of your user model. Your load balancer has no visibility into API keys or subscription tiers. Your application code is the only place that can enforce "free-tier users get 100 requests per hour and premium users get 10,000." These layers complement rather than duplicate each other.

Key Takeaways

Treat rate limiting as a security control, not just a performance feature: Rate limiting maps directly to OWASP API Security API4:2023 (Unrestricted Resource Consumption). Design your limits based on the threat model for each endpoint -- what would an attacker gain from unlimited access to this route?
Layer middleware and dependencies for defense in depth: Use BaseHTTPMiddleware for a global rate limit that catches every request, and Depends() for endpoint-specific limits. The middleware is your safety net; dependencies are your fine-grained control. Set per-route limits proportional to the endpoint's cost and sensitivity.
Choose client identification carefully: API key identification is reliable but only works for authenticated endpoints. IP-based identification is universal but unreliable behind NATs and trivially spoofed via X-Forwarded-For. Use the most specific identifier available, and never trust proxy headers from untrusted sources.
Always exclude health checks and documentation paths: Load balancers and monitoring systems call health endpoints constantly. Rate limiting them creates false alerts. Documentation paths like /docs and /openapi.json serve FastAPI's Swagger UI and should not count against any limit.
Include rate limit headers on every response: Add X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset to successful responses. Add Retry-After to 429 responses. This helps clients self-regulate and reduces retry storms. Watch the IETF RateLimit-Policy and RateLimit header draft for the future standard.
Know the limits of application-level protection: Application-level rate limiting cannot stop distributed denial-of-service attacks or network-layer floods. Pair it with edge-level protection (Cloudflare, AWS Shield) and load-balancer-level connection limits for a complete defense-in-depth posture.
Consider existing libraries for production: slowapi and fastapi-limiter handle edge cases like Redis failover, storage abstraction, and WebSocket support. Build custom when you need business-specific logic like tiered pricing limits or integration with your auth system.

Rate limiting is the first line of defense between your API and the outside world, but it is not the last. A misconfigured client, a retry loop without backoff, or a deliberate attack can all send traffic volumes that overwhelm your service. FastAPI's middleware and dependency systems give you the building blocks to enforce limits at exactly the right granularity -- globally for baseline protection and per-route for precise control. Pair those with edge-level and infrastructure-level protections, and you have a system where each layer handles what the others cannot, keeping your API fast and available for the clients that play by the rules.