Python Serialization: What It Is, Why It Matters, and Every Way to Do It

Every time your Python program saves data to disk, sends an object across a network, or hands a result off to another process, something important happens behind the scenes: serialization. It is one of the most universal concepts in software, and Python gives you more ways to do it than almost any other language. This article walks through what serialization actually is, why you need it, and every serious approach Python offers — from the standard library to modern high-performance libraries that are reshaping how production systems handle data in 2026 and beyond. Along the way, we will examine the security traps, the cognitive models that help you think about boundaries, and the real-world architectural decisions that separate robust systems from fragile ones.

Serialization is a topic that many developers encounter early and revisit constantly throughout their careers. The basics are easy to learn, but the choices you make — which format, which library, how much validation to enforce at the boundary — have downstream consequences for performance, security, and maintainability. This guide covers the full picture: conceptual foundations followed by working, modern, idiomatic Python code for every major approach.

What Serialization Is

Serialization is the process of converting a live, in-memory object into a format that can be stored or transmitted, and then reconstructed later. The reverse operation — taking that stored format and rebuilding the original object — is called deserialization (also called "loading," "decoding," or "unpickling" depending on the tool).

Consider a Python dictionary like this:

user = {
    "id": 42,
    "username": "kandi",
    "roles": ["admin", "instructor"],
    "active": True,
}

That object exists in your computer's RAM while the program is running. The moment the program exits, it is gone. If you want to save it to a file, send it to a web API, or hand it to another Python process, you need to convert it into something that can travel: a string of text, a stream of bytes, or a structured binary blob. That conversion is serialization. Reading it back and reconstructing the dictionary is deserialization.

The Hitchhiker's Guide to Python defines serialization as converting structured data into a format that enables sharing or storage while preserving the ability to reconstruct the original structure. — Source: docs.python-guide.org, Data Serialization

Two structural choices come up immediately when designing your serialization approach. The first is the format: text-based formats like JSON and TOML are human-readable and interoperable; binary formats like Protocol Buffers and MessagePack are compact and fast. The second is the schema: do you enforce a strict schema on the data structure, or serialize whatever shape arrives? Schema-enforced serialization catches bugs at the boundary; schemaless serialization is more flexible but lets invalid data slip through.

Terminology

You will see "marshalling," "encoding," "pickling," and "serialization" used interchangeably in different communities. In Python, the terms tend to be format-specific: "pickling" means Python's pickle module, "encoding/decoding" is common for JSON and binary formats, and "serialization/deserialization" (or "serdes") is the general umbrella term.

Why Serialization Is Needed

At its core, serialization exists because programs are temporary but data is not. There are four distinct situations that make it necessary.

Persistence. Saving application state to disk so it survives after the program closes. Configuration files, saved games, user profiles, and cached computation results are all examples. Without serialization, none of those things would outlast the process that created them.

Communication. When two programs talk to each other — across a network, between microservices, through a message queue — they cannot share raw memory. They must agree on a format that both sides can produce and consume. REST APIs exchange JSON. gRPC services use Protocol Buffers. Message queues frequently use MessagePack or Avro. Serialization is the medium of that conversation.

Inter-process communication. Even within a single machine, separate processes cannot share memory directly. Python's multiprocessing module serializes objects automatically using pickle when passing data between workers, making it possible to distribute work across CPU cores.

Caching. Storing a computed result in Redis, Memcached, or a local file so that an expensive operation does not have to run again requires serialization. The computed Python object must be converted into something that the cache can store and return later.

Pro Tip

Validate data at the boundaries of your program — at the point where external data enters your code. Inside your own codebase, trust your own objects and skip redundant validation. This is where libraries like Pydantic and msgspec shine: they handle the untrusted edge, not every internal function call.

JSON: The Universal Standard

Python's built-in json module handles the serialization format that dominates the modern web. According to the Python Developer Survey, a large proportion of Python developers work with JSON data regularly, making it the format you will encounter frequently in practice.

The four functions you need to know are json.dumps(), json.dump(), json.loads(), and json.load(). The "s" variants work with strings; the non-"s" variants work with file objects.

import json
from datetime import datetime, UTC
from pathlib import Path

# --- Basic encoding ---
data = {
    "id": 1,
    "username": "kandi",
    "scores": [95, 87, 100],
    "active": True,
}

# To a JSON string
json_string = json.dumps(data, indent=2)
print(json_string)

# To a file
Path("user.json").write_text(json.dumps(data, indent=2), encoding="utf-8")

# --- Basic decoding ---
raw = '{"id": 1, "username": "kandi", "scores": [95, 87, 100], "active": true}'
parsed = json.loads(raw)
print(parsed["username"])  # kandi

# From a file
loaded = json.loads(Path("user.json").read_text(encoding="utf-8"))

The standard library's json module only handles a limited set of Python types: dicts, lists, strings, numbers, booleans, and None. Custom types require a custom encoder. The cleanest modern approach uses functools.singledispatch to build an extensible serializer without a giant isinstance chain.

import json
import functools
from datetime import date, datetime
from decimal import Decimal
from uuid import UUID
from enum import Enum


@functools.singledispatch
def to_serializable(val):
    """Fallback: convert unknown types to string."""
    return str(val)


@to_serializable.register
def _(val: datetime) -> str:
    return val.isoformat()


@to_serializable.register
def _(val: date) -> str:
    return val.isoformat()


@to_serializable.register
def _(val: Decimal) -> float:
    return float(val)


@to_serializable.register
def _(val: UUID) -> str:
    return str(val)


@to_serializable.register
def _(val: Enum) -> str:
    return val.value


class ExtendedEncoder(json.JSONEncoder):
    def default(self, obj):
        return to_serializable(obj)


# Usage
payload = {
    "user_id": UUID("12345678-1234-5678-1234-567812345678"),
    "registered": datetime(2024, 6, 1, 9, 0, 0),
    "balance": Decimal("99.95"),
}

print(json.dumps(payload, cls=ExtendedEncoder, indent=2))
Note

Using functools.singledispatch means you can register new type handlers anywhere in your codebase without modifying a central encoder. This pattern, recommended by Hynek Schlawack in his widely-cited piece on Python serialization, is far more maintainable than a single monolithic default() method that knows about every type.

Pickle: Python-Native Binary Serialization

The pickle module is Python's built-in binary serialization format. Unlike JSON, it can serialize almost any Python object: custom class instances, lambda functions (with the dill extension), NumPy arrays, scikit-learn models, and more. The tradeoff is that pickle output is Python-specific and version-sensitive — you cannot load a pickle file in JavaScript, and loading a pickle file generated by a different Python or library version may fail or behave unexpectedly.

import pickle
from pathlib import Path
from dataclasses import dataclass


@dataclass
class ModelMetadata:
    name: str
    version: str
    accuracy: float
    features: list[str]


metadata = ModelMetadata(
    name="fraud_detector",
    version="1.4.2",
    accuracy=0.973,
    features=["amount", "merchant_category", "hour_of_day"],
)

output_path = Path("model_metadata.pkl")

# Serialize with highest available protocol for best performance
with output_path.open("wb") as f:
    pickle.dump(metadata, f, protocol=pickle.HIGHEST_PROTOCOL)

# Deserialize
with output_path.open("rb") as f:
    loaded = pickle.load(f)

print(loaded.name)       # fraud_detector
print(loaded.accuracy)   # 0.973
Security Warning

Never unpickle data from an untrusted source. The pickle format can execute arbitrary code during deserialization. This is a well-documented, intentional design characteristic, not a bug — but it means pickle is appropriate only for data you produced yourself. For anything crossing a network boundary or arriving from user input, use JSON, msgspec, or another schema-validated format instead.

When working with machine learning models that contain large NumPy arrays, joblib is often a better choice than raw pickle because it handles array compression and memory-mapped loading more efficiently. In benchmarks, joblib has shown meaningful efficiency gains over pickle for ensemble models with large array payloads. For scikit-learn models specifically, using pickle.HIGHEST_PROTOCOL (protocol 5 in Python 3.8+) can reduce output size relative to older protocols.

import joblib
from pathlib import Path

# joblib is pip-installable: pip install joblib
# Compress at level 3 (range 0-9, higher = smaller but slower)
joblib.dump(metadata, Path("metadata_compressed.joblib"), compress=3)

# Load back
loaded_meta = joblib.load(Path("metadata_compressed.joblib"))
print(loaded_meta.version)  # 1.4.2

Dataclasses and the Standard Library

Python's dataclasses module, available since Python 3.7, provides a clean way to define structured data without boilerplate. Dataclasses do not serialize themselves out of the box, but they pair well with the standard library's dataclasses.asdict() and dataclasses.astuple() functions, which convert instances to plain Python types that json.dumps() can handle directly.

import json
from dataclasses import dataclass, field, asdict
from datetime import datetime


@dataclass
class Address:
    street: str
    city: str
    country: str
    postal_code: str


@dataclass
class User:
    id: int
    username: str
    email: str
    address: Address
    tags: list[str] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)


user = User(
    id=1,
    username="kandi",
    email="[email protected]",
    address=Address("123 Main St", "Austin", "US", "78701"),
    tags=["instructor", "admin"],
)

# Convert to a serializable dict
user_dict = asdict(user)

# created_at is a datetime, so we need a custom encoder
class DatetimeEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        return super().default(obj)

json_output = json.dumps(user_dict, cls=DatetimeEncoder, indent=2)
print(json_output)

For deserialization back into a dataclass from a dict, there is no built-in helper — you construct the instance manually or use a library. For simple flat dataclasses this is trivial; for nested structures it requires a recursive approach or a third-party library like dacite or cattrs.

import json
from dataclasses import dataclass


@dataclass
class Config:
    host: str
    port: int
    debug: bool


raw_json = '{"host": "localhost", "port": 8080, "debug": false}'
data = json.loads(raw_json)

# Manual reconstruction
config = Config(**data)
print(config.port)  # 8080
Pro Tip

Benchmarks show plain Python dataclasses are significantly faster than Pydantic for object creation and dict conversion. If you do not need runtime type validation — only structure and serialization — dataclasses are the most performant standard-library option.

attrs and cattrs: The Composable Alternative

One approach conspicuously absent from many serialization guides is attrs paired with cattrs. This combination is worth understanding because it embodies a fundamentally different design philosophy: serialization rules live entirely outside the data model. While Pydantic and msgspec bake serialization behavior into the class definition, cattrs treats structuring and unstructuring as separate, composable operations that you configure independently. This is the principle Hynek Schlawack, creator of attrs, has championed: your data classes should describe structure, not serialization concerns.

# pip install attrs cattrs
from attrs import define
from cattrs import structure, unstructure
from cattrs.preconf.orjson import make_converter
from datetime import datetime

@define
class Address:
    street: str
    city: str
    country: str
    postal_code: str

@define
class User:
    id: int
    username: str
    email: str
    address: Address
    tags: list[str] = []
    created_at: datetime = datetime.now()

# Basic structuring/unstructuring
user = User(
    id=1,
    username="kandi",
    email="[email protected]",
    address=Address("123 Main St", "Austin", "US", "78701"),
    tags=["instructor", "admin"],
)

# Unstructure to a plain dict (ready for json.dumps)
user_dict = unstructure(user)
print(user_dict)

# Structure back from a dict
raw = {
    "id": 2,
    "username": "alice",
    "email": "[email protected]",
    "address": {"street": "1 Oak Ave", "city": "Denver", "country": "US", "postal_code": "80201"},
    "tags": ["viewer"],
    "created_at": "2026-03-01T09:00:00",
}

# Use a preconfigured orjson converter for speed + datetime handling
converter = make_converter()
converter.register_structure_hook(datetime, lambda v, _: datetime.fromisoformat(v) if isinstance(v, str) else v)
converter.register_unstructure_hook(datetime, lambda v: v.isoformat())

user2 = converter.structure(raw, User)
print(user2.username)  # alice

# Serialize directly to JSON bytes via orjson
json_bytes = converter.dumps(user)
print(json_bytes)

The key advantage is composability. With cattrs, you can register different structuring hooks for different contexts — one converter for your REST API, another for your internal message queue, a third for database serialization — all operating on the same attrs classes without modifying them. The cattrs.preconf package ships with pre-configured converters for orjson, msgpack, PyYAML, tomlkit, and several others, making this a genuinely versatile approach.

The tradeoff is explicitness: cattrs does not validate by default. You get structuring (type coercion) but not rich error messages about why the input was wrong. For validation-heavy boundaries, Pydantic is more ergonomic. For systems where the data model should not be aware of its own serialization format, attrs and cattrs are the cleaner architectural choice.

Pydantic: Validation-First Serialization

Pydantic is the most widely adopted data validation library in the Python ecosystem, and validation is the operative word. Its core value is not raw serialization speed but the guarantees it provides: type coercion, field validators, custom constraints, and detailed error messages when incoming data does not match the expected schema. Pydantic v2 (currently at v2.12.5 stable, with v2.13 in beta), rewritten with a Rust core via pydantic-core, is dramatically faster than v1 while retaining the same developer-friendly API. The v2.12 release added support for Python 3.14, and the upcoming v2.13 release (currently in beta) introduces a new polymorphic_serialization option for cleaner handling of subclassed models.

from pydantic import BaseModel, field_validator, EmailStr, model_validator
from datetime import datetime
from typing import Annotated
from pydantic import Field


class Address(BaseModel):
    street: str
    city: str
    country: str
    postal_code: str


class User(BaseModel):
    id: int
    username: Annotated[str, Field(min_length=3, max_length=64)]
    email: str
    address: Address
    tags: list[str] = []
    created_at: datetime = Field(default_factory=datetime.now)

    @field_validator("username")
    @classmethod
    def username_must_be_lowercase(cls, v: str) -> str:
        return v.lower()

    @field_validator("tags")
    @classmethod
    def tags_must_be_unique(cls, v: list[str]) -> list[str]:
        return list(dict.fromkeys(v))  # deduplicate, preserve order


# --- Deserialization with validation ---
raw = {
    "id": 1,
    "username": "KANDI",
    "email": "[email protected]",
    "address": {
        "street": "123 Main St",
        "city": "Austin",
        "country": "US",
        "postal_code": "78701",
    },
    "tags": ["instructor", "admin", "instructor"],  # duplicate
}

user = User.model_validate(raw)
print(user.username)       # kandi (lowercased)
print(user.tags)           # ['instructor', 'admin'] (deduplicated)

# --- Serialization ---
# To dict
user_dict = user.model_dump()

# To JSON string
user_json = user.model_dump_json()

# To JSON bytes (Pydantic v2 natively produces bytes)
user_bytes = user.model_dump_json().encode()

# --- Parsing from JSON directly ---
json_string = '{"id": 2, "username": "alice", "email": "[email protected]", "address": {"street": "1 Oak Ave", "city": "Denver", "country": "US", "postal_code": "80201"}}'
user2 = User.model_validate_json(json_string)
print(user2.id)  # 2

Pydantic is the right choice when you are handling data at a trust boundary: parsing a request body in FastAPI, reading a config file, or processing webhook payloads. Use it where you genuinely need the validation logic. For internal data movement within a trusted codebase, its overhead is unnecessary.

msgspec: The Performance-First Choice in 2026

msgspec is a serialization and validation library that has risen significantly in adoption among performance-focused Python backends. Now at version 0.20.0 with Python 3.14 support (including freethreaded mode), it supports JSON, MessagePack, YAML, and TOML out of the box, and its benchmarks are striking: according to the library's own documentation and independent benchmarks, msgspec decodes and validates JSON faster than orjson alone can decode it — meaning the schema validation comes at zero additional cost relative to a fast parser.

The msgspec documentation reports that encoding and decoding can be 10 to 80 times faster than alternative libraries for supported types. — Source: jcristharif.com/msgspec

The central abstraction is msgspec.Struct, a slot-based class implemented in C that feels familiar if you have used dataclasses but is significantly faster. According to the official benchmarks, Structs are 5 to 60 times faster than comparable alternatives for common operations, including creation, equality checks, and serialization.

# pip install msgspec
import msgspec
from msgspec import Struct, field
from typing import Annotated
import msgspec.json


class Address(Struct):
    street: str
    city: str
    country: str
    postal_code: str


class User(Struct):
    id: int
    username: str
    email: str
    address: Address
    tags: list[str] = []
    active: bool = True


# --- Encoding to JSON bytes ---
user = User(
    id=1,
    username="kandi",
    email="[email protected]",
    address=Address("123 Main St", "Austin", "US", "78701"),
    tags=["instructor", "admin"],
)

encoded = msgspec.json.encode(user)
print(encoded)
# b'{"id":1,"username":"kandi","email":"[email protected]","address":{"street":"123 Main St",...},...}'

# --- Decoding from JSON bytes with schema validation ---
raw = b'{"id":2,"username":"alice","email":"[email protected]","address":{"street":"1 Oak","city":"Denver","country":"US","postal_code":"80201"}}'
decoded = msgspec.json.decode(raw, type=User)
print(decoded.username)  # alice
print(type(decoded))     # 

# --- MessagePack for binary transport ---
packed = msgspec.msgpack.encode(user)
unpacked = msgspec.msgpack.decode(packed, type=User)
print(unpacked.id)  # 1

When schema validation fails, msgspec raises a msgspec.ValidationError with a clear message. The validation happens during decoding in a single pass over the data — no second traversal, no temporary intermediate objects.

import msgspec

class Item(msgspec.Struct):
    name: str
    price: float
    quantity: int

try:
    bad = msgspec.json.decode(b'{"name": "widget", "price": "not-a-number", "quantity": 5}', type=Item)
except msgspec.ValidationError as e:
    print(e)
    # Expected `float`, got `str` - at `$.price`
When to Choose msgspec

Choose msgspec when throughput is a constraint: high-volume APIs, message queue consumers, or any system that processes large numbers of messages per second. Its tradeoff versus Pydantic is that it offers fewer built-in validator utilities and less descriptive error output. Use Pydantic at the developer-facing boundary where error messages matter; use msgspec in the hot path where speed matters.

Protocol Buffers: Cross-Language Binary Format

When you need serialization that works across Python, Go, Java, C++, Rust, and other languages with guaranteed schema compatibility over time, Protocol Buffers (protobuf) is the industry-standard answer. Developed at Google and open-sourced, protobuf uses a schema definition language to describe data structures, then generates code for each target language. The generated Python classes handle encoding and decoding to a compact binary format.

First define your schema in a .proto file:

// user.proto
syntax = "proto3";

package myapp;

message Address {
  string street = 1;
  string city = 2;
  string country = 3;
  string postal_code = 4;
}

message User {
  int32 id = 1;
  string username = 2;
  string email = 3;
  Address address = 4;
  repeated string tags = 5;
  bool active = 6;
}

After generating Python code with protoc, usage looks like this:

# pip install protobuf
# Generate: protoc --python_out=. user.proto
# This produces user_pb2.py

import user_pb2

# Build a message
user = user_pb2.User(
    id=1,
    username="kandi",
    email="[email protected]",
    address=user_pb2.Address(
        street="123 Main St",
        city="Austin",
        country="US",
        postal_code="78701",
    ),
    tags=["instructor", "admin"],
    active=True,
)

# Serialize to bytes
serialized = user.SerializeToString()
print(len(serialized))  # compact binary, much smaller than JSON for large payloads

# Deserialize from bytes
recovered = user_pb2.User()
recovered.ParseFromString(serialized)
print(recovered.username)  # kandi

# Serialize to JSON (for debugging or mixed-format pipelines)
from google.protobuf import json_format
json_str = json_format.MessageToJson(user)
print(json_str)

Protobuf shines in polyglot environments and gRPC-based microservice architectures. Its schema evolution features — adding new optional fields without breaking existing clients — make it well-suited for long-lived systems with multiple language clients.

orjson: Faster JSON, Drop-In Replacement

If you want to stay in the JSON world but need significantly better performance than the standard library offers, orjson is the drop-in replacement. Written in Rust (now at v3.11.7 with Python 3.14 support and 3.15 ABI compatibility), orjson natively handles datetime, UUID, numpy arrays, and dataclasses without a custom encoder, and produces bytes rather than str (which is already the right type for network transmission). Its latest builds include runtime-detected AVX-512 optimizations for string processing on supported hardware.

# pip install orjson
import orjson
from datetime import datetime, timezone
from uuid import UUID
from dataclasses import dataclass


@dataclass
class Event:
    event_id: UUID
    user: str
    timestamp: datetime
    payload: dict


event = Event(
    event_id=UUID("12345678-1234-5678-1234-567812345678"),
    user="kandi",
    timestamp=datetime(2026, 3, 8, 12, 0, 0, tzinfo=timezone.utc),
    payload={"action": "login", "ip": "192.168.1.1"},
)

# orjson handles dataclasses, datetime, and UUID natively
encoded: bytes = orjson.dumps(event, option=orjson.OPT_INDENT_2)
print(encoded.decode())

# Decoding returns standard Python types
decoded = orjson.loads(encoded)
print(decoded["user"])       # kandi
print(decoded["timestamp"])  # "2026-03-08T12:00:00+00:00"

# For maximum throughput, skip indent
fast_bytes: bytes = orjson.dumps(event)
Pro Tip

Pydantic v2 can be configured to use orjson as its JSON backend. Add model_config = ConfigDict(json_encoders={...}) or simply call model.model_dump_json(), which Pydantic v2 already implements with a fast Rust-based encoder. For extreme throughput, msgspec outperforms even orjson-backed Pydantic because validation and encoding happen in a single compiled pass.

Serialization Security: The Threat Model

Serialization is an attack surface. Every deserialization boundary is a point where your program trusts a stream of bytes to reconstruct objects, and that trust can be exploited. Understanding the threat model is not optional for production code — it is the difference between a robust system and one waiting to be compromised.

Pickle: The Permanent Risk

The pickle security warning deserves more than a callout box. In early 2025, Reversing Labs researchers identified malicious pickle files uploaded to Hugging Face, one of the largest ML model repositories. In the same year, Sonatype disclosed four CVEs (including CVE-2025-1716) against picklescan, the security tool that platforms like Hugging Face rely on to detect malicious pickles — demonstrating that even the defenses against pickle attacks can be bypassed.

The attack mechanism is fundamental to how pickle works: the __reduce__ method in any Python class can specify arbitrary functions to call during deserialization. An attacker crafts a pickle payload that calls os.system(), subprocess.run(), or even pip.main() to install a malicious package silently. There is no way to make pickle "safe" for untrusted data without fundamentally restricting what it can deserialize, which defeats its purpose.

# DANGER: This is what a malicious pickle looks like.
# DO NOT run this code. It is here for educational purposes only.
import pickle
import os

class MaliciousPayload:
    def __reduce__(self):
        # This executes during deserialization
        return os.system, ("echo 'You have been compromised'",)

# An attacker would distribute this as a .pkl file
# payload = pickle.dumps(MaliciousPayload())
# Loading it runs the system command:
# pickle.loads(payload)  # EXECUTES: os.system("echo 'You have been compromised'")
Defense in Depth

For ML model serialization, prefer SafeTensors (the Hugging Face format designed to eliminate arbitrary code execution), ONNX for inference models, or torch.load() with weights_only=True (introduced in PyTorch 2.6 as the default). If you must use pickle for internal caching, sign your pickle files with HMAC and verify the signature before loading. Never deserialize pickles from the network, user uploads, or third-party model repositories without sandboxing.

JSON Injection and Type Confusion

JSON is safer than pickle because it cannot execute code during parsing, but it is not immune to security issues. JSON deserialization without validation can lead to type confusion attacks: a field your code expects to be a string arrives as a nested object, causing unexpected behavior in downstream logic. This is precisely why boundary validation with Pydantic or msgspec matters — it is not just about type correctness, it is about preventing an attacker from controlling the shape of your internal data structures.

Another subtle risk: extremely large or deeply nested JSON payloads can cause denial-of-service through memory exhaustion or stack overflow during parsing. The standard library's json module has no built-in depth or size limits. Both orjson (which implements a 1,024-level recursion limit) and msgspec (which validates against a schema with bounded nesting) offer protection here by default.

Streaming and Incremental Serialization

The serialization patterns discussed so far assume you have the complete object in memory before encoding it. In production systems processing large datasets, log streams, or real-time event feeds, that assumption breaks down. Streaming serialization lets you encode and transmit data incrementally, without buffering the entire payload.

Newline-Delimited JSON (NDJSON)

NDJSON is a simple convention: one valid JSON object per line, separated by newline characters. It is widely used in logging pipelines, data processing, and API streaming responses. Unlike JSON arrays, NDJSON can be parsed incrementally — each line is a standalone JSON document.

import json
from pathlib import Path

events = [
    {"event": "login", "user": "kandi", "ts": "2026-03-01T09:00:00Z"},
    {"event": "page_view", "user": "kandi", "ts": "2026-03-01T09:01:15Z"},
    {"event": "logout", "user": "kandi", "ts": "2026-03-01T09:45:00Z"},
]

# Write NDJSON
output = Path("events.ndjson")
with output.open("w", encoding="utf-8") as f:
    for event in events:
        f.write(json.dumps(event) + "\n")

# Read NDJSON incrementally (constant memory)
with output.open("r", encoding="utf-8") as f:
    for line in f:
        event = json.loads(line)
        print(f"{event['event']} by {event['user']}")

# msgspec has built-in NDJSON support (since 0.19.x)
import msgspec

encoder = msgspec.json.Encoder()
# encode_lines writes multiple objects as NDJSON
ndjson_bytes = encoder.encode_lines(events)

decoder = msgspec.json.Decoder()
decoded_events = decoder.decode_lines(ndjson_bytes)
print(len(decoded_events))  # 3

When to Avoid Serializing Altogether

The most efficient serialization is no serialization. Before reaching for any library, ask whether the data needs to cross a boundary at all. If your worker processes share the same machine, consider shared memory (multiprocessing.shared_memory, available since Python 3.8) for large arrays, which avoids the copy-and-encode overhead entirely. For inter-thread communication within a single process, pass Python objects directly through queue.Queue — no serialization needed. The mental model that helps here is thinking of serialization as a tax you pay every time data crosses an architectural boundary. Reduce boundaries, reduce tax.

Choosing the Right Tool

The decision comes down to four questions: Do you need cross-language compatibility? Do you need runtime validation of untrusted data? Do you need maximum throughput? Do you want to stay within the standard library?

Tool Format Validation Cross-Language Speed (relative) Best For
json JSON text none yes baseline General use, stdlib only
pickle Binary none Python only fast Caching, IPC, ML models (trusted sources only)
dataclasses + json JSON text none yes fast Structured data, no external deps
attrs + cattrs JSON / MsgPack / YAML / TOML composable yes fast Decoupled models, multi-format systems
Pydantic v2 JSON / dict rich yes moderate API boundaries, FastAPI, config parsing
msgspec JSON / MsgPack / TOML / YAML typed JSON yes, MsgPack partial fastest High-throughput services, message consumers
orjson JSON bytes none yes very fast Fast JSON drop-in, dataclass/datetime support
Protocol Buffers Binary schema yes very fast Polyglot systems, gRPC, schema evolution

A practical heuristic for 2026: start with dataclasses and the standard json module. Add Pydantic when you need to validate external input. Switch to msgspec if profiling reveals serialization is a bottleneck. Use Protocol Buffers only when you genuinely need cross-language interoperability or schema evolution in a distributed system. Consider attrs + cattrs when you want clean separation between your data model and its serialization rules, particularly in larger codebases where the same model needs to serialize differently in different contexts.

Key Takeaways

  1. Serialization is a boundary problem. The most important place to serialize carefully — with validation and type enforcement — is wherever untrusted data enters your system. Internal object movement between your own functions rarely needs this overhead.
  2. JSON is the default, but not always the best choice. It is universally readable and interoperable, but it is text-based and lacks native types for dates, UUIDs, and binary data. For internal services under your control, MessagePack or Protocol Buffers will be faster and more compact.
  3. Pickle is a code execution vector, not just a convenience risk. Real-world attacks in 2025 demonstrated that malicious pickle files on platforms like Hugging Face can compromise entire systems. Reserve pickle for data you generated yourself, sign it with HMAC, and prefer SafeTensors or ONNX for model serialization.
  4. msgspec is the performance winner in 2026. If throughput matters and you are building a Python-native system, msgspec 0.20.0 offers the best combination of speed, type safety, and multi-format support available in the ecosystem today, now with Python 3.14 and freethreaded support.
  5. Pydantic and msgspec solve different problems. Pydantic excels at developer ergonomics, rich error messages, and its integration with the Python API ecosystem (especially FastAPI). msgspec excels at raw throughput with a narrower feature set. Choose based on what the code actually needs at that point in your system, not as a blanket policy across the entire codebase.
  6. Separation of concerns matters at scale. The attrs + cattrs pattern — where serialization rules are external to the data model — scales better in large codebases where the same model may need to serialize differently depending on context. Consider it when your project outgrows a single serialization strategy.
  7. The most efficient serialization is no serialization. Before adding a library, ask whether the data actually needs to cross a boundary. Shared memory, direct object passing, and architectural decisions that reduce the number of serialization points can eliminate more overhead than any library optimization.
  8. Always use context managers for file I/O. Whether you are writing pickle files, JSON files, or any other format, the with statement ensures files are closed properly even when exceptions occur. This is basic Pythonic hygiene that prevents resource leaks in production systems.

Serialization in Python has never had more good options. The right choice depends on your constraints: performance requirements, language interoperability, the degree of trust you place in incoming data, and how much complexity you want to introduce into your dependency tree. The tools covered here span the full spectrum, from the zero-dependency standard library to the fastest compiled libraries available in the ecosystem. Think of your serialization strategy the way you think about security: in layers, with the strongest validation at the outermost boundary and progressively less overhead as data moves inward through trusted territory.

back to articles