What is the difference between response.json() and client.stream() in httpx?

response.json() downloads the entire response body into memory at once, then parses it as JSON. client.stream() opens a connection and lets you read the response incrementally, chunk by chunk, without loading the full body into memory. Use stream when the response is large (file downloads, large datasets) or when you want to process data as it arrives (Server-Sent Events, LLM token streaming).

What is the httpx-sse library used for?

httpx-sse is a companion library for httpx that provides a clean interface for consuming Server-Sent Events (SSE). It wraps an httpx streaming response and yields parsed ServerSentEvent objects with event, data, id, and retry fields. It is commonly used to stream output from LLM APIs like OpenAI, Anthropic, and others that use the text/event-stream content type.

Do I need to close the stream manually with httpx?

If you use the async with client.stream() context manager, the stream is closed automatically when the block exits. If you use the manual mode (client.send(request, stream=True)), you must call await response.aclose() yourself. Failing to close a stream will leak the underlying TCP connection.

How to Stream Responses from Async APIs in Python with httpx

A standard await client.get() call waits for the entire response to arrive before giving you any data. That works fine for a 2KB JSON payload. It does not work for a 500MB file download, a newline-delimited JSON feed, or the token-by-token output of a large language model. For these use cases, you need streaming -- reading the response incrementally as the server sends it, without loading the entire body into memory at once.

This article covers every streaming method httpx provides, from raw byte chunks to parsed Server-Sent Events, with working code for each use case.

When You Need Streaming vs a Regular Response

A regular (non-streaming) response works by downloading the entire response body into memory before returning control to your code. For small JSON payloads from REST APIs, this is perfectly fine. But there are three situations where streaming is the better approach.

The first is large file downloads. If you are pulling a 200MB dataset or a binary file from an API, loading the entire thing into memory before writing it to disk wastes RAM and delays the start of processing. Streaming lets you write each chunk to disk as it arrives.

The second is real-time event feeds. APIs that use Server-Sent Events (SSE) or newline-delimited JSON (NDJSON) push data continuously over a long-lived connection. You need to process each event as it arrives, not wait for the stream to "finish" (which may never happen).

The third is LLM APIs. Services from providers like OpenAI and Anthropic stream tokens one at a time via SSE so users see output appearing word by word. Without streaming, the user stares at a blank screen until the entire response is generated.

The Basics: client.stream and Async Iterators

In httpx, you access streaming responses through the client.stream() context manager. Unlike client.get(), which returns a fully-loaded Response object, client.stream() opens a connection and gives you a response that you can iterate over incrementally.

import asyncio
import httpx

async def main():
    async with httpx.AsyncClient() as client:
        async with client.stream("GET", "https://httpbin.org/stream/5") as response:
            response.raise_for_status()
            async for line in response.aiter_lines():
                print(line)

asyncio.run(main())

The async with client.stream() block opens the connection and reads the headers, but does not download the body. Inside the block, you choose how to read the body: as raw bytes, as decoded text, or as individual lines. When the async with block exits, the connection is closed and returned to the pool.

Warning

You must read the response body inside the async with block. Once the block exits, the connection is closed. Attempting to iterate the response after the block will raise an error.

Streaming Bytes: Downloading Large Files

The aiter_bytes() method yields raw byte chunks as they arrive from the server. This is the right method for binary content like images, PDFs, ZIP archives, or any large file you want to write to disk without holding the entire thing in memory.

import asyncio
import httpx

async def download_file(url, output_path):
    async with httpx.AsyncClient() as client:
        async with client.stream("GET", url) as response:
            response.raise_for_status()
            total = int(response.headers.get("content-length", 0))
            downloaded = 0

            with open(output_path, "wb") as f:
                async for chunk in response.aiter_bytes(chunk_size=8192):
                    f.write(chunk)
                    downloaded += len(chunk)
                    if total:
                        pct = (downloaded / total) * 100
                        print(f"\rProgress: {pct:.1f}%", end="", flush=True)

    print(f"\nSaved to {output_path}")

asyncio.run(download_file(
    "https://speed.hetzner.de/100MB.bin",
    "testfile.bin"
))

The chunk_size=8192 parameter controls how many bytes each iteration yields. Larger chunks mean fewer iterations and slightly less overhead. Smaller chunks mean more responsive progress updates. 8KB is a common default.

Pro Tip

For production file downloads, consider using aiofiles for non-blocking disk writes. The synchronous open() call in the example above blocks the event loop briefly during each f.write(). For small chunks this is negligible, but for extremely high-throughput scenarios, async file I/O prevents the event loop from stalling.

Streaming Lines: Newline-Delimited JSON

Some APIs return data as newline-delimited JSON (NDJSON) -- one JSON object per line, streamed continuously. This format is common in logging APIs, data export endpoints, and real-time feeds. The aiter_lines() method splits the incoming data on newline boundaries and yields each line as a string.

import asyncio
import json
import httpx

async def process_ndjson_stream(url):
    async with httpx.AsyncClient() as client:
        async with client.stream("GET", url) as response:
            response.raise_for_status()
            async for line in response.aiter_lines():
                if not line.strip():
                    continue  # Skip empty lines
                record = json.loads(line)
                print(f"Received: {record}")

asyncio.run(process_ndjson_stream("https://httpbin.org/stream/10"))

Each line is yielded as soon as the newline character is received. You do not have to wait for the entire response to finish. This makes aiter_lines() ideal for processing data feeds where records arrive continuously over a long-lived connection.

Streaming Server-Sent Events with httpx-sse

Server-Sent Events (SSE) use a specific text-based protocol where each event has fields like event:, data:, id:, and retry:. While you could parse SSE manually using aiter_lines(), the httpx-sse library does it correctly and handles edge cases like multi-line data fields and reconnection hints.

# pip install httpx-sse
import asyncio
import httpx
from httpx_sse import aconnect_sse

async def consume_sse(url):
    async with httpx.AsyncClient() as client:
        async with aconnect_sse(client, "GET", url) as event_source:
            event_source.response.raise_for_status()
            async for sse in event_source.aiter_sse():
                print(f"Event: {sse.event}")
                print(f"Data:  {sse.data}")
                print(f"ID:    {sse.id}")
                print("---")

asyncio.run(consume_sse("https://example.com/events"))

The aconnect_sse function wraps an httpx streaming response and yields parsed ServerSentEvent objects. Each object has .event (the event type), .data (the payload, typically JSON), .id (the event ID for reconnection), and .retry (the server's suggested reconnection delay in milliseconds). This saves you from writing the SSE parsing logic yourself and handles the protocol correctly.

Real-World Pattern: Streaming LLM API Output

LLM APIs like OpenAI's chat completions endpoint use SSE to stream tokens as they are generated. The client receives each token individually, allowing the user interface to display text word by word instead of waiting for the entire response. Here is what the client-side consumption looks like using httpx and httpx-sse.

import asyncio
import json
import httpx
from httpx_sse import aconnect_sse

async def stream_chat_completion(api_key, prompt):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": "gpt-4o",
        "messages": [{"role": "user", "content": prompt}],
        "stream": True,
    }

    full_response = []

    async with httpx.AsyncClient() as client:
        async with aconnect_sse(
            client, "POST", url, json=payload, headers=headers
        ) as event_source:
            async for sse in event_source.aiter_sse():
                if sse.data == "[DONE]":
                    break

                chunk = json.loads(sse.data)
                delta = chunk["choices"][0].get("delta", {})
                content = delta.get("content", "")

                if content:
                    print(content, end="", flush=True)
                    full_response.append(content)

    print()  # Newline after streaming finishes
    return "".join(full_response)

Each SSE event contains a JSON object with a choices[0].delta.content field holding one or more tokens. The special [DONE] data field signals the end of the stream. By printing each content fragment with end="", the output appears character by character in the terminal, mimicking the word-by-word rendering you see in chat interfaces.

Note

The httpx-sse library does not include automatic reconnection. If the connection drops mid-stream, you need to handle reconnection yourself using the sse.id and sse.retry fields from the last received event. For production LLM integrations, consider wrapping the stream in retry logic with the last event ID as a cursor.

The Streaming Methods at a Glance

Method	Yields	Best For
`response.aiter_bytes()`	Raw byte chunks	File downloads, binary content, disk streaming
`response.aiter_text()`	Decoded text chunks	Large text responses, HTML pages, CSV data
`response.aiter_lines()`	Individual lines of text	NDJSON feeds, log streams, line-oriented protocols
`response.aiter_raw()`	Raw bytes without content decoding	Proxying responses, custom decompression
`response.aread()`	Entire body at once (async)	Conditional reads inside a stream block

All of these methods are available on the response object inside a client.stream() context manager. Outside of a stream context, only response.content (bytes) and response.text (string) are available, and the full body has already been loaded into memory.

Key Takeaways

Use client.stream() when responses are large or continuous: Regular responses load the entire body into memory. Streaming reads data incrementally, keeping memory usage constant regardless of response size.
Choose the right iterator for your data format: aiter_bytes() for binary files, aiter_lines() for line-delimited text, aiter_text() for decoded text chunks, and aiter_raw() when you need bytes without content decoding.
Use httpx-sse for Server-Sent Events: SSE parsing has subtle rules around multi-line data fields, event types, and reconnection hints. The httpx-sse library handles all of this and gives you clean ServerSentEvent objects to work with.
LLM APIs stream tokens via SSE: Modern AI APIs send completions token by token using the text/event-stream content type. Consuming this stream with httpx-sse lets you display text progressively instead of waiting for the full response.
Always read inside the stream context: The connection is closed when the async with client.stream() block exits. Read and process all data inside the block, not after it.

Streaming is the bridge between simple API calls and production-grade data processing. Once you understand the pattern -- open a stream, iterate over it, process each chunk -- you can handle file downloads, real-time feeds, and LLM output with the same underlying technique.