The difference between a GPU that serves one user and a GPU that serves a thousand is not hardware — it's how you schedule the work.
Python didn't just win the race to train AI models. It won the race to serve them. vLLM, the open-source inference engine that powers a growing share of production LLM deployments, is written in Python. Hugging Face's Text Generation Inference (TGI) is Python. NVIDIA's Triton Inference Server exposes a Python API. FastAPI, the framework that jumped from 29% to 38% adoption in the JetBrains Python Developers Survey, is the default way to wrap a model in an HTTP endpoint.
But spinning up a model behind an API is the easy part. The hard part is what happens when a thousand users hit that endpoint at the same time. Without batching — the technique of grouping multiple inference requests together so the GPU processes them simultaneously — you are paying for a GPU that sits mostly idle while it handles one request at a time. With the right batching strategy, the same hardware can serve orders of magnitude more users at comparable latency.
This article explains every major batching strategy used in production LLM serving today, from the naive approach that wastes your GPU to the continuous batching systems that power real-world deployments. Real code, real systems, real engineering tradeoffs. No hand-waving.
Why Batching Matters: The Memory Bandwidth Bottleneck
To understand why batching is so critical, you need to understand what makes LLM inference slow. It's not computation — it's memory bandwidth.
An LLM generates tokens one at a time in an autoregressive loop. Each token requires a forward pass through the entire model. For an 8-billion-parameter model stored in BF16 precision, that means loading 16 GB of weight data from GPU high-bandwidth memory (HBM) into the compute units for every single token generated. On an NVIDIA A100 with 2 TB/s of memory bandwidth, just streaming the weights takes roughly 8 milliseconds — regardless of whether you're generating one token or a hundred tokens for a hundred different users.
This is the key insight. As Woosuk Kwon and colleagues wrote in their SOSP 2023 paper introducing PagedAttention, the compute utilization in serving LLMs can be improved by batching multiple requests, because the requests share the same model weights and the overhead of moving those weights is amortized across the batch.
When you process a single request, the GPU loads the weight matrix, multiplies it by one activation vector, and moves on. That's a matrix-vector multiplication with an arithmetic intensity of roughly 1 FLOP per byte — far below what modern GPUs can sustain. When you batch 32 requests together, the GPU loads the same weight matrix once but multiplies it by 32 activation vectors. The arithmetic intensity jumps to 32 FLOPs per byte, pushing the operation from memory-bound toward compute-bound territory. Same weight transfer, 32 times the useful work.
import torch
import time
# Simulate the difference between unbatched and batched inference
weight = torch.randn(4096, 4096, device="cuda", dtype=torch.bfloat16)
# Single request: matrix-vector multiply
single_input = torch.randn(1, 4096, device="cuda", dtype=torch.bfloat16)
# Batched: matrix-matrix multiply (32 requests)
batched_input = torch.randn(32, 4096, device="cuda", dtype=torch.bfloat16)
# Warm up
for _ in range(10):
_ = single_input @ weight.T
_ = batched_input @ weight.T
torch.cuda.synchronize()
# Benchmark single
start = time.perf_counter()
for _ in range(1000):
_ = single_input @ weight.T
torch.cuda.synchronize()
single_time = time.perf_counter() - start
# Benchmark batched
start = time.perf_counter()
for _ in range(1000):
_ = batched_input @ weight.T
torch.cuda.synchronize()
batch_time = time.perf_counter() - start
print(f"Single request: {single_time:.3f}s for 1000 iterations")
print(f"Batched (32): {batch_time:.3f}s for 1000 iterations")
print(f"Per-request time: {batch_time/32:.3f}s vs {single_time:.3f}s")
print(f"Throughput gain: {(single_time * 32) / batch_time:.1f}x")
Run this on a GPU and the throughput gain will be dramatic. The batched version won't take 32 times longer — it will take only marginally longer than the single-request version, because the weight-loading cost is amortized across all 32 requests.
Strategy 1: No Batching (The Baseline)
The simplest serving setup processes one request at a time. If you wrap a model in a FastAPI endpoint without any batching logic, this is what you get:
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 128):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=max_tokens)
return {"text": tokenizer.decode(outputs[0], skip_special_tokens=True)}
This works for development and testing. In production, it's a disaster. While the GPU is generating tokens for one user, every other request sits in a queue. The GPU's compute units are almost entirely idle during the memory-bound decode phase, waiting for weight data to stream from HBM. You are paying for a high-end GPU and using a fraction of its capacity.
Strategy 2: Static Batching
Static batching is the simplest improvement over no batching. You collect requests into fixed-size groups and process them together. NVIDIA's Triton Inference Server and early versions of TensorFlow Serving use this approach.
The mechanism is straightforward: configure a batch size (say, 16) and a timeout window (say, 100 milliseconds). When the first request arrives, the server either waits to collect 15 more requests within the timeout, or runs a partial batch once the timer expires.
The problem with static batching for LLMs is fundamental: different requests generate different numbers of tokens. If one request in a batch of 16 generates 10 tokens and another generates 500, the short request finishes early but must wait for the entire batch to complete before returning a response to the client. The GPU pads the shorter sequences with empty computation, wasting cycles. The long sequence holds every other request hostage.
As the Orca paper (Yu et al., OSDI 2022) from Seoul National University and FriendliAI observed, existing inference serving systems do not perform well on workloads with a multi-iteration characteristic, due to their inflexible scheduling mechanism that cannot change the current batch of requests being processed. Requests that finish earlier than others in a batch cannot return immediately, while newly arrived requests must wait for the current batch to complete.
For non-LLM models like image classifiers or embedding models, where every request takes roughly the same amount of time, static batching works well. For autoregressive generation with variable output lengths, it's a poor fit.
Strategy 3: Dynamic Batching
Dynamic batching improves on static batching by allowing the batch size and composition to vary based on incoming traffic. Instead of a fixed batch size, the server assembles each batch dynamically, up to a maximum size, based on what's available in the queue.
import asyncio
from collections import deque
class DynamicBatcher:
def __init__(self, model, tokenizer, max_batch_size=16,
max_wait_ms=50):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = deque()
self.processing = False
async def add_request(self, prompt: str, max_tokens: int = 128):
future = asyncio.get_event_loop().create_future()
self.queue.append((prompt, max_tokens, future))
if not self.processing:
self.processing = True
asyncio.create_task(self._process_batches())
return await future
async def _process_batches(self):
while self.queue:
# Wait briefly to accumulate requests
await asyncio.sleep(self.max_wait_ms / 1000)
# Collect up to max_batch_size requests
batch = []
while self.queue and len(batch) < self.max_batch_size:
batch.append(self.queue.popleft())
# Process the batch
prompts = [item[0] for item in batch]
max_tokens = max(item[1] for item in batch)
inputs = self.tokenizer(
prompts, return_tensors="pt",
padding=True, truncation=True
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs, max_new_tokens=max_tokens
)
# Return results
for i, (_, _, future) in enumerate(batch):
text = self.tokenizer.decode(
outputs[i], skip_special_tokens=True
)
future.set_result(text)
self.processing = False
Dynamic batching is suitable for models like Stable Diffusion, image classifiers, and embedding models — workloads where each request takes approximately the same time. For LLMs, it still suffers from the variable-length problem: the entire batch is held until the longest sequence finishes generating.
Strategy 4: Continuous Batching (Iteration-Level Scheduling)
Continuous batching is the breakthrough that made large-scale LLM serving practical. The idea was introduced in the Orca paper (Yu et al., OSDI 2022) from Seoul National University and FriendliAI, and it fundamentally changes how the scheduler interacts with the execution engine.
Instead of scheduling at the request level (run this batch of requests to completion), continuous batching schedules at the iteration level — meaning the scheduler makes a decision at every single forward pass of the model. After each token is generated for each request in the batch, the scheduler can remove any request that has finished (hit the stop token or max length), add new requests from the queue into the freed slots, and proceed with the next forward pass on the updated batch.
This eliminates the problem of short requests waiting for long ones. The moment a request finishes, its slot is immediately recycled. New requests start generating without waiting for an entire batch to complete.
The Orca paper proposed two key techniques: iteration-level scheduling, described above, and selective batching, which handles the fact that requests in different phases (initial prompt processing vs. token generation) require different amounts of computation. Together, these techniques form the foundation that every modern LLM serving system builds on.
Here's a simplified illustration of the scheduling loop:
class ContinuousBatchScheduler:
"""Simplified continuous batching scheduler.
Real implementations (vLLM, TGI) handle KV cache management,
preemption, chunked prefill, and memory pressure -- this
illustrates the core iteration-level scheduling concept.
"""
def __init__(self, model, max_batch_size=64):
self.model = model
self.max_batch_size = max_batch_size
self.running = {} # request_id -> request state
self.waiting = [] # queue of pending requests
def step(self):
"""Execute one iteration of the model on the current batch."""
# 1. Remove completed requests
finished = [
rid for rid, req in self.running.items()
if req.is_done()
]
for rid in finished:
self.running[rid].return_result()
del self.running[rid]
# 2. Admit new requests into freed slots
while (self.waiting
and len(self.running) < self.max_batch_size):
new_req = self.waiting.pop(0)
self.running[new_req.id] = new_req
if not self.running:
return
# 3. Run one forward pass for all active requests
batch_inputs = self._prepare_batch()
next_tokens = self.model.forward(batch_inputs)
# 4. Append generated tokens to each request
for rid, token in zip(self.running.keys(), next_tokens):
self.running[rid].append_token(token)
def serve(self):
"""Main serving loop -- runs step() continuously."""
while True:
self.step()
The critical difference from dynamic batching is visible in the step() method: at every iteration, the scheduler inspects all running requests, evicts finished ones, and admits new ones. No request waits for an unrelated request to finish. The GPU stays full.
FriendliAI, the company behind the Orca paper, holds patents on iteration-level scheduling (which they call "iteration batching") in the US and Korea. The technique has since been adopted by virtually every major LLM inference engine: vLLM, TGI, TensorRT-LLM, SGLang, and DeepSpeed-FastGen all implement variants of continuous batching.
PagedAttention: The Memory Innovation That Unlocked Continuous Batching
Continuous batching has a major practical problem: the KV cache. Each request in the batch needs to store its key-value cache — the intermediate attention states accumulated during generation — in GPU memory. For a 13B parameter model, the KV cache for a single request with a 2048-token context can consume hundreds of megabytes.
Before vLLM, inference systems pre-allocated a contiguous block of GPU memory for each request's maximum possible context length. This caused massive waste through three types of memory fragmentation: internal fragmentation (allocated slots never used because the sequence ends early), external fragmentation (gaps between allocated blocks too small to use), and reservation waste (reserving for the maximum context even when the actual sequence is short).
The vLLM paper (Kwon et al., SOSP 2023) measured that prior systems wasted 60% to 80% of their KV cache memory to fragmentation. The solution was PagedAttention, which Kwon and colleagues described in the paper as an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. The analogy is precise: KV cache blocks are pages, tokens are bytes, and requests are processes. A block table maps each request's logical blocks to physical blocks scattered anywhere in GPU memory.
In the vLLM blog post announcing the project, Kwon wrote that because the blocks do not need to be contiguous in memory, the keys and values can be managed in a more flexible way, much like virtual memory in an OS — with blocks as pages, tokens as bytes, and sequences as processes.
The result: KV cache memory waste dropped to under 4%, allowing the system to fit substantially more requests into each batch. Larger batches mean higher throughput. The SOSP 2023 evaluations showed vLLM improving throughput by 2 to 4 times over state-of-the-art systems like FasterTransformer and Orca at the same latency level.
from vllm import LLM, SamplingParams
# vLLM handles continuous batching and PagedAttention
# automatically -- no manual batch management needed
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
max_model_len=4096,
gpu_memory_utilization=0.90, # Use 90% of GPU memory for KV cache
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
)
# vLLM batches these automatically using continuous batching
prompts = [
"Explain the concept of attention in transformers.",
"Write a Python function to merge two sorted lists.",
"What are the tradeoffs between LSTM and Transformer architectures?",
"Summarize the key ideas in the PagedAttention paper.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"Prompt: {prompt[:60]}...")
print(f"Output: {generated[:120]}...")
print()
Under the hood, vLLM's scheduler runs the continuous batching loop described above, with PagedAttention managing the KV cache. The user never has to think about batching — they submit prompts and the engine maximizes throughput automatically.
Chunked Prefill: Mixing Prompt Processing and Generation
There's one more subtlety that production systems handle. LLM inference has two distinct phases: the prefill phase (processing the entire input prompt in a single forward pass to populate the KV cache) and the decode phase (generating tokens one at a time). These phases have very different computational profiles — prefill is compute-heavy, while decode is memory-bandwidth-heavy.
If a long prompt arrives while many decode-phase requests are running, processing the entire prompt in one shot can create a latency spike for all the decode requests sharing the GPU. Chunked prefill, introduced in systems like Sarathi-Serve and adopted by vLLM, solves this by splitting long prompts into smaller chunks and interleaving them with decode steps. This keeps the per-step latency predictable while still making progress on new prompt processing.
from vllm import LLM
# vLLM supports chunked prefill via engine configuration
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_chunked_prefill=True,
max_num_batched_tokens=2048, # Max tokens per step (prefill + decode)
)
The max_num_batched_tokens parameter controls the total token budget per step. The scheduler allocates this budget across both prefill chunks and decode tokens, balancing throughput against latency stability.
Speculative Decoding: Predicting Multiple Tokens at Once
Batching isn't the only technique for accelerating inference. Speculative decoding, proposed independently by Leviathan et al. at Google and Chen et al. at DeepMind in 2022–2023, takes a fundamentally different approach: instead of generating one token per forward pass, it generates several at once by using a small, fast "draft" model to propose candidate tokens that a larger "target" model then verifies in parallel.
As Yaniv Leviathan of Google Research explained in a retrospective on the technique, the approach aims to increase concurrency by computing several tokens in parallel, based on the expectation that additional parallel computational resources are available while tokens are computed serially. The original paper demonstrated 2 to 3 times speedups on tasks like translation and summarization. Leviathan noted that Google has since deployed speculative decoding across multiple products, including AI Overviews in Google Search, where it produces results faster while maintaining the same quality of responses.
Speculative decoding is lossless — the verification step guarantees that the output distribution is identical to what the target model would have produced on its own. If the draft model's guess is wrong, only the incorrect tokens are rejected and the target model's corrected token is used instead.
from vllm import LLM, SamplingParams
# Speculative decoding in vLLM: small model drafts,
# large model verifies
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5,
)
params = SamplingParams(temperature=0.0, max_tokens=256)
output = llm.generate("Explain how PagedAttention works.", params)
print(output[0].outputs[0].text)
Speculative decoding and continuous batching are complementary — they optimize different bottlenecks. Continuous batching maximizes throughput by filling the GPU with many requests. Speculative decoding reduces latency for individual requests by generating multiple tokens per step. Production systems like vLLM support both simultaneously.
Putting It Together: A Production Serving Stack
A realistic production LLM serving deployment in Python uses vLLM as the inference engine behind an OpenAI-compatible API server:
# Start a vLLM server with production settings
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--max-num-batched-tokens 4096 \
--max-num-seqs 256
# Client code using the OpenAI-compatible API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require an API key
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Explain batching in LLM inference."}
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)
Behind this simple API call, vLLM is running continuous batching with PagedAttention, dynamically scheduling requests at the iteration level, managing the KV cache with paged memory allocation, and optionally applying chunked prefill and speculative decoding. The entire stack is Python, from the API layer (FastAPI) through the scheduler and memory manager to the engine orchestration — with only the CUDA kernels and attention computations dropping down to compiled code.
Choosing the Right Strategy
Non-LLM models (image classifiers, embedding models, diffusion models): Dynamic batching works well. Requests take roughly the same time, so the variable-length problem doesn't apply. Triton Inference Server or a custom batching wrapper around your model is sufficient.
LLM serving in production: Use continuous batching. This is not optional for any deployment serving more than a handful of concurrent users. vLLM, TGI, TensorRT-LLM, and SGLang all implement it. Choose based on your ecosystem requirements, hardware, and model support.
Latency-sensitive single-user scenarios: If you are optimizing for the lowest possible time-to-first-token for a single user rather than maximizing throughput across many users, simpler serving setups may suffice. But even here, speculative decoding can cut latency significantly.
Massive offline batch jobs: For processing thousands of prompts without latency constraints, continuous batching engines still win, but you can be more aggressive with batch sizes and disable latency-oriented optimizations like chunked prefill.
Conclusion
The evolution of LLM batching strategies reads like a compressed history of computer science applied to a new problem. Static batching borrowed from classical request scheduling. Continuous batching borrowed iteration-level scheduling from the Orca paper's insight that autoregressive generation demands a fundamentally different scheduling granularity. PagedAttention borrowed virtual memory paging from operating systems. Speculative decoding borrowed speculative execution from CPU pipeline design.
Every one of these innovations was implemented in Python, and every production LLM serving engine exposes a Python interface. The engineer who understands these batching strategies doesn't just know how to call vllm.generate() — they understand why that call is fast, what tradeoffs the engine makes on every forward pass, and how to configure the system for their specific workload.
The GPU is expensive. The model is expensive. The batching strategy is what determines whether you need ten GPUs or one.
No copy-paste tutorials here. Go serve something.
cd ..