Python's pickle module is one of the most powerful -- and most misunderstood -- tools in the standard library. It lets you serialize virtually any Python object into a byte stream and reconstruct it later, preserving structure, types, and relationships. That flexibility is precisely what makes it dangerous.
This article digs into how pickle actually works under the hood, walks through the protocol versions that have evolved over three decades, examines the real-world security incidents that have made pickle a four-letter word in some circles, and shows you how to use it responsibly when the situation calls for it -- including how to audit your own codebases for hidden pickle exposure. No hand-waving. Real code, real context, real comprehension.
What Pickle Actually Does
Serialization is the process of converting a Python object -- a dictionary, a class instance, a nested data structure -- into a byte stream that can be stored in a file, saved to a database, or transmitted across a network. Deserialization reverses that process, reconstructing the original object from the byte stream. The official Python documentation describes pickling as the conversion of a Python object hierarchy into a byte stream, and unpickling as the inverse operation.
At its core, pickle operates as a tiny stack-based virtual machine. When you pickle an object, the module generates a sequence of opcodes -- low-level instructions that tell the unpickler how to rebuild the object. When you unpickle, those opcodes execute one by one, reconstructing the object from scratch. This is the crucial detail that separates pickle from data-only formats like JSON: pickle streams contain executable instructions, not just data. That distinction is the root of both pickle's power and its danger.
Here is a basic example:
import pickle
# A simple dictionary with mixed types
data = {
"name": "Alice",
"scores": [98, 87, 92],
"metadata": {"enrolled": True, "year": 2024}
}
# Serialize to bytes
pickled = pickle.dumps(data)
print(type(pickled)) # <class 'bytes'>
print(len(pickled)) # varies by protocol
# Deserialize back to a Python object
restored = pickle.loads(pickled)
assert restored == data
print(restored)
That looks simple enough. But pickle is not limited to basic types. It handles class instances, nested objects, circular references, and custom reconstruction logic. It can even serialize objects that reference each other in cycles -- something JSON cannot do at all. That capability is where both the power and the danger live.
The pickletools module in the standard library lets you disassemble pickle opcodes to inspect what a pickle stream actually contains. We cover this in detail later in this article.
A Brief History: Pickle Protocol Versions and Their PEPs
The pickle protocol was originally designed in 1995, during the era of floppy disks and dial-up, when the performance characteristics of storage media made concerns like RAM bandwidth during serialization irrelevant. Since then, the protocol has been overhauled multiple times, each revision documented in a Python Enhancement Proposal (PEP). Understanding these protocol versions is not just trivia -- it directly affects the size of your serialized data, compatibility across Python versions, and whether you can take advantage of zero-copy optimizations.
Protocol 0 and 1 (Original)
Protocol 0 is the original ASCII-based format. Protocol 1 introduced a binary format that was more space-efficient. Both date back to early Python and remain readable by every version of Python ever released. You will rarely have a reason to use these today, but they still exist for backward compatibility.
Protocol 2 -- PEP 307
PEP 307, authored by Guido van Rossum and Tim Peters and finalized in January 2003, introduced protocol 2 alongside Python 2.3. The motivation was concrete: pickling new-style classes in Python 2.2 was clumsy and produced bloated output. A trivial new-style class could produce a pickled string three times longer than the equivalent classic class.
Protocol 2 introduced several important changes. It added the __reduce_ex__ method, giving objects finer control over how they decompose for pickling based on the protocol version in use. It introduced __getnewargs__ for new-style classes, and added the __newobj__ unpickling function that enabled more efficient reconstruction. The PEP also formalized the extension registry, a mechanism for assigning integer codes to frequently pickled classes to reduce serialized size.
Notably, PEP 307 also marked a turning point in how the Python community talked about pickle security. The Python 2.3 release notes acknowledged that unpickling should not be considered a safe operation. Python 2.2 had included hooks for a __safe_for_unpickling__ attribute, but that code was never audited and was removed in 2.3.
Protocol 3 (Python 3.0)
Protocol 3 arrived with Python 3.0 and added explicit support for bytes objects. It cannot be unpickled by Python 2.x, making it the first protocol to formally break backward compatibility with the Python 2 line.
Protocol 4 -- PEP 3154
PEP 3154, authored by Antoine Pitrou, introduced protocol 4 in Python 3.4. This version tackled a practical limitation: protocol 3 could not handle objects larger than 4 GiB because it used 32-bit length prefixes. Protocol 4 switched to 64-bit framing, supporting very large objects. It also added the ability to pickle more kinds of objects (including nested classes and qualnames) and introduced data format optimizations like shorter opcodes for small objects. Protocol 4 became the default in Python 3.8 and remained the default through Python 3.13.
Protocol 5 -- PEP 574
PEP 574, authored by Antoine Pitrou and accepted in 2018, introduced protocol 5 alongside Python 3.8. This was a significant evolution driven by the scientific computing and data science communities, where pickle had become a de facto wire protocol for transferring large arrays between processes.
The core problem PEP 574 solved was unnecessary memory copying. When you pickled a large NumPy array under protocol 4, the entire array had to be copied into the pickle byte stream -- potentially doubling memory usage for multi-gigabyte datasets. Protocol 5 introduced out-of-band data buffers, allowing large data to be handled as a separate stream of zero-copy buffers while the application manages those buffers optimally.
The PEP introduced three key additions: a PickleBuffer type for __reduce_ex__ implementations, a buffer_callback parameter during pickling, and a buffers parameter during unpickling. As of Python 3.14, released on October 7, 2025, protocol 5 is now the default protocol (Python 3.14 What's New).
Here is how you check which protocol your Python installation uses by default:
import pickle
print(f"Default protocol: {pickle.DEFAULT_PROTOCOL}")
print(f"Highest available: {pickle.HIGHEST_PROTOCOL}")
print(f"All protocols: {list(range(pickle.HIGHEST_PROTOCOL + 1))}")
And here is how to specify a particular protocol version:
import pickle
data = {"key": "value"}
# Use the highest available protocol
high = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
# Use protocol 2 for backward compatibility
legacy = pickle.dumps(data, protocol=2)
print(f"Protocol {pickle.HIGHEST_PROTOCOL} size: {len(high)} bytes")
print(f"Protocol 2 size: {len(legacy)} bytes")
If you are using multiprocessing or frameworks like Dask that rely on pickle for inter-process communication, the upgrade to protocol 5 as the default in Python 3.14 means you get zero-copy buffer support without changing any code -- as long as the objects you are pickling implement PickleBuffer in their __reduce_ex__ methods. NumPy arrays support this natively.
How Pickle Handles Custom Objects
Pickle's ability to serialize custom class instances is what separates it from simpler formats like JSON. When pickle encounters an object, it needs to figure out how to decompose it into something it can store and later reconstruct. This is where the dunder methods come in.
The __reduce__ and __reduce_ex__ Methods
When an object is pickled, the pickle module calls its __reduce__ (or __reduce_ex__) method. This method returns a tuple describing how to reconstruct the object during unpickling. The tuple contains a callable, arguments for that callable, and optionally additional state information.
import pickle
class GameState:
def __init__(self, player, level, score):
self.player = player
self.level = level
self.score = score
def __reduce__(self):
# Return (callable, args) -- pickle calls callable(*args) to reconstruct
return (GameState, (self.player, self.level, self.score))
def __repr__(self):
return f"GameState(player={self.player!r}, level={self.level}, score={self.score})"
state = GameState("Alice", 5, 12000)
pickled = pickle.dumps(state)
restored = pickle.loads(pickled)
print(restored) # GameState(player='Alice', level=5, score=12000)
Think about what just happened. The __reduce__ method told pickle: "To reconstruct this object, call GameState with these three arguments." The unpickler obediently does so. This is the same mechanism that makes pickle dangerous -- the only difference between a safe and a malicious pickle is what callable is returned.
The __getstate__ and __setstate__ Methods
For more nuanced control, you can use __getstate__ to specify exactly what gets pickled and __setstate__ to control how it gets restored:
import pickle
class DatabaseConnection:
def __init__(self, host, port):
self.host = host
self.port = port
self.connection = self._connect() # Not serializable
def _connect(self):
# Simulate a live connection object
return f"Connection({self.host}:{self.port})"
def __getstate__(self):
# Exclude the live connection from the pickled state
state = self.__dict__.copy()
del state["connection"]
return state
def __setstate__(self, state):
# Restore state and re-establish the connection
self.__dict__.update(state)
self.connection = self._connect()
def __repr__(self):
return f"DB({self.host}:{self.port}, conn={self.connection})"
db = DatabaseConnection("localhost", 5432)
print(f"Before: {db}")
pickled = pickle.dumps(db)
restored = pickle.loads(pickled)
print(f"After: {restored}")
This __getstate__/__setstate__ pattern is essential for objects that hold non-serializable resources like file handles, sockets, or database connections. You strip the transient state during pickling and rebuild it during unpickling. It is also the right approach for objects that cache computed values you would rather recompute than store.
The Security Problem: Why Pickle Is Dangerous
The Python documentation includes a prominent warning that pickle is not secure and that users should only unpickle data they trust. It states plainly that malicious pickle data can execute arbitrary code during unpickling (Python docs, pickle module). The danger is baked into pickle's design. Because __reduce__ can return any callable, an attacker can craft a pickle stream that, when deserialized, executes arbitrary system commands:
import pickle
import os
class MaliciousPayload:
def __reduce__(self):
# When unpickled, this calls os.system with the given argument
return (os.system, ("echo 'You have been compromised'",))
# This is what a malicious pickle looks like
malicious_bytes = pickle.dumps(MaliciousPayload())
# DO NOT run pickle.loads() on untrusted data
# pickle.loads(malicious_bytes) # Would execute the system command
The __reduce__ method tells pickle: "To reconstruct this object, call os.system with this argument." The unpickler obediently does so. Replace that echo command with something that opens a reverse shell, exfiltrates data, or installs malware, and you have a full remote code execution vulnerability. The attack works because pickle's design makes no distinction between "reconstruct this dictionary" and "run this system command" -- they are both just callables with arguments.
Real-World Incidents
This is not a hypothetical attack vector. In February 2024, JFrog researchers reported finding malicious ML models on Hugging Face, the largest open-source AI model hosting platform. These models were stored in PyTorch format, which uses pickle serialization under the hood. When loaded, the embedded payloads executed reverse shells, performed credential theft, or fingerprinted the victim's system (JFrog, February 2024).
In February 2025, ReversingLabs researcher Karlo Zanki discovered two additional malicious models on Hugging Face that used a novel evasion technique dubbed "nullifAI." The malicious Python content appeared at the beginning of the pickle file, and the object serialization broke shortly after the payload executed -- causing Hugging Face's scanning tool, Picklescan, to throw an error without flagging the malicious content. Hugging Face removed the models after they were reported (ReversingLabs, February 2025).
By April 2025, Protect AI reported having scanned 4.47 million unique model versions on Hugging Face, identifying 352,000 unsafe or suspicious issues across 51,700 models (Protect AI/Hugging Face, April 2025). The scanning tools themselves proved vulnerable: Sonatype disclosed four CVEs in Picklescan (CVE-2025-1716, CVE-2025-1889, CVE-2025-1944, CVE-2025-1945), and JFrog Security Research independently found three additional zero-day vulnerabilities that allowed attackers to completely bypass the tool's detection. Those were reported in June 2025 and fixed in Picklescan version 0.0.31 by September 2025 (JFrog, December 2025).
In January 2026, Palo Alto Networks' Unit 42 revealed that the risk extends beyond pickle files themselves. Researchers found remote code execution vulnerabilities in supporting Python libraries used by ML models on Hugging Face -- including NVIDIA's NeMo framework (CVE-2025-23304) and Salesforce libraries (CVE-2026-22584). Even models stored in the "safer" Safetensors format were vulnerable when their supporting libraries loaded executable metadata from configuration files (Unit 42, January 2026).
The Industry Response
The web framework Django took decisive action on this issue. Django's PickleSerializer, which was used for session serialization, was formally deprecated in Django 4.1 (August 2022) and fully removed in Django 5.0, released in December 2023 (Django Deprecation Timeline).
PyTorch changed the default value of the weights_only argument in torch.load() from False to True in version 2.6 (January 2025), after warning users since version 2.4. This prevents deserialization of arbitrary Python objects when loading model files (PyTorch Developer Forum, November 2024). NumPy's numpy.load() defaults to allow_pickle=False for the same reason.
Never call pickle.loads() on data received from an untrusted source. This includes data from web requests, public APIs, user uploads, or any network source you do not fully control. No amount of input validation prevents a crafted pickle stream from executing code. The Python documentation recommends using HMAC to verify data integrity when needed.
Defensive Techniques: Using Pickle More Safely
If you must use pickle within trusted environments, there are concrete steps you can take to reduce your risk. None of these make pickle "safe" against arbitrary input, but they narrow the attack surface considerably when combined.
Restricted Unpickler
You can subclass pickle.Unpickler and override find_class to create an allowlist of permitted types:
import pickle
import io
class RestrictedUnpickler(pickle.Unpickler):
SAFE_CLASSES = {
("builtins", "list"),
("builtins", "dict"),
("builtins", "set"),
("builtins", "str"),
("builtins", "int"),
("builtins", "float"),
("builtins", "bool"),
("builtins", "tuple"),
("builtins", "bytes"),
}
def find_class(self, module, name):
if (module, name) not in self.SAFE_CLASSES:
raise pickle.UnpicklingError(
f"Blocked attempt to unpickle: {module}.{name}"
)
return super().find_class(module, name)
def safe_loads(data):
return RestrictedUnpickler(io.BytesIO(data)).load()
# Test with safe data
safe_data = pickle.dumps({"key": [1, 2, 3]})
print(safe_loads(safe_data)) # {'key': [1, 2, 3]}
# Test with dangerous data
import os
class Evil:
def __reduce__(self):
return (os.system, ("echo pwned",))
evil_data = pickle.dumps(Evil())
try:
safe_loads(evil_data)
except pickle.UnpicklingError as e:
print(f"Blocked: {e}") # Blocked: Blocked attempt to unpickle: posix.system
This is a legitimate hardening technique, but it is not bulletproof. Pickle and its variants cannot be made fully safe against arbitrary input because they allow execution by design. The allowlist approach is a defense-in-depth layer, not a complete solution.
HMAC Signing
If you control both the pickling and unpickling sides, sign your pickle data with HMAC to detect tampering:
import pickle
import hmac
import hashlib
SECRET_KEY = b"your-secret-key-here"
def sign_pickle(obj):
data = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
signature = hmac.new(SECRET_KEY, data, hashlib.sha256).digest()
return signature + data
def verify_and_load(signed_data):
signature = signed_data[:32] # SHA-256 produces 32 bytes
data = signed_data[32:]
expected = hmac.new(SECRET_KEY, data, hashlib.sha256).digest()
if not hmac.compare_digest(signature, expected):
raise ValueError("Pickle data has been tampered with")
return pickle.loads(data)
# Usage
original = {"scores": [100, 95, 87], "user": "Alice"}
signed = sign_pickle(original)
restored = verify_and_load(signed)
print(restored)
# Tamper detection
tampered = signed[:32] + b"corrupted" + signed[40:]
try:
verify_and_load(tampered)
except ValueError as e:
print(f"Caught: {e}")
This ensures that only data you pickled yourself can be unpickled. But it does not protect against the case where the signing key is compromised or where an insider with access to the key crafts a malicious payload.
Combining Both: Restrict + Verify
The strongest approach for trusted environments is to combine HMAC verification with a restricted unpickler. Verify the signature first to confirm data integrity, then use the restricted unpickler as a second layer of defense:
import pickle
import hmac
import hashlib
import io
SECRET_KEY = b"your-secret-key-here"
class RestrictedUnpickler(pickle.Unpickler):
SAFE_CLASSES = {
("builtins", "list"),
("builtins", "dict"),
("builtins", "set"),
("builtins", "str"),
("builtins", "int"),
("builtins", "float"),
("builtins", "bool"),
("builtins", "tuple"),
}
def find_class(self, module, name):
if (module, name) not in self.SAFE_CLASSES:
raise pickle.UnpicklingError(
f"Blocked: {module}.{name}"
)
return super().find_class(module, name)
def secure_loads(signed_data):
# Layer 1: Verify HMAC signature
signature = signed_data[:32]
data = signed_data[32:]
expected = hmac.new(SECRET_KEY, data, hashlib.sha256).digest()
if not hmac.compare_digest(signature, expected):
raise ValueError("Signature mismatch -- data tampered")
# Layer 2: Restricted unpickler
return RestrictedUnpickler(io.BytesIO(data)).load()
How to Use Python Pickle Safely
The following steps walk through the minimal safe pattern for using pickle in a controlled, trusted environment. Each step maps to a concrete action you can take in your own code.
-
Confirm the data source is fully trusted. Before calling
pickle.loads()orpickle.load()on any data, verify that you generated it yourself or received it through an authenticated, integrity-verified channel. If there is any doubt about the origin, stop here and use JSON or another data-only format instead. -
Sign your pickle data with HMAC before storing or transmitting it. Use
hmac.new(SECRET_KEY, data, hashlib.sha256).digest()to produce a 32-byte signature and prepend it to the serialized bytes. Store the secret key securely and never include it in the serialized data itself. -
Verify the HMAC signature before unpickling. On the receiving end, extract the first 32 bytes as the signature and the remainder as the pickle payload. Use
hmac.compare_digest()— not==— to compare the stored and computed signatures. Raise an error and abort if they do not match. -
Subclass
pickle.Unpicklerand overridefind_class. Define a set of allowed(module, name)tuples — only the types your application legitimately pickles. Raisepickle.UnpicklingErrorfor any type not on the allowlist. This is your second line of defense if the HMAC check is somehow bypassed. -
Use
pickle.HIGHEST_PROTOCOLwhen serializing. This ensures you get the most efficient encoding available in your Python version. On Python 3.14+, this is protocol 5 with zero-copy buffer support. Avoid hardcoding protocol numbers unless you have a specific backward-compatibility requirement. -
Open pickle files in binary mode. Always use
"wb"when writing and"rb"when reading. Opening in text mode corrupts the byte stream silently on some platforms. -
Scan unfamiliar pickle files with fickling or pickletools before loading. Run
fickling --check-safety yourfile.pklor usepickletools.dis()to inspect the opcode stream. Look forGLOBALorREDUCEopcodes referencingos,subprocess, orbuiltins. If you see them, do not load the file. -
Add Bandit to your CI/CD pipeline. Run
bandit -r your_project/ -t B301,B302,B303as part of every build. This catches newpickle.loads()calls before they reach production and flags indirect pickle usage throughshelveand related modules.
When Pickle Is the Right Choice (and When It Is Not)
Pickle is appropriate in a narrow set of circumstances. It works well for inter-process communication within a single trusted system, for caching computed results locally, for saving and restoring application state where the data never crosses a trust boundary, and for distributed computing frameworks like Dask and multiprocessing that operate within controlled infrastructure. Protocol 5's out-of-band buffers make it particularly efficient for scientific computing pipelines that move large arrays between worker processes. Python 3.14's new concurrent.interpreters module also uses pickle-based transfers for data exchange between subinterpreters (Python 3.14 What's New).
Pickle is not appropriate for data from untrusted sources (period), for web APIs or client-server communication, for long-term storage where Python version compatibility matters, for any context where data crosses a trust boundary, or for ML model distribution to the general public.
Alternatives by Use Case
For data interchange with other languages or systems, JSON is the standard choice. It cannot execute code, it is human-readable, and it is supported everywhere. The tradeoff is that it only handles basic types natively:
import json
data = {"name": "Alice", "scores": [98, 87, 92]}
serialized = json.dumps(data)
restored = json.loads(serialized)
For configuration files, YAML with safe_load (never yaml.load) or TOML are solid options. Python 3.11 added tomllib to the standard library, making TOML a dependency-free choice for configuration.
For ML model storage, the Safetensors format (developed by Hugging Face) is designed specifically to avoid arbitrary code execution. It only stores tensor data and metadata in a simple, safe format. However, as the Unit 42 research from January 2026 demonstrated, Safetensors alone does not guarantee safety if the supporting libraries that load model metadata or configuration files have their own code execution vectors (Unit 42, January 2026). The lesson: safer serialization formats reduce risk, but the entire loading pipeline matters.
For high-performance binary serialization, MessagePack and Protocol Buffers both offer compact binary formats without the code execution risk. For tabular data specifically, Apache Parquet and Apache Arrow offer columnar storage with excellent compression and zero-copy reads.
Inspecting Pickle Streams with pickletools
Python ships with a pickletools module that lets you disassemble pickle byte streams. This is invaluable for understanding what a pickle file actually contains -- and for auditing pickle data before loading it:
import pickle
import pickletools
data = {"language": "Python", "versions": [3.12, 3.13, 3.14]}
pickled = pickle.dumps(data, protocol=4)
# Disassemble the pickle stream
pickletools.dis(pickled)
This prints the individual opcodes, showing you exactly what instructions the unpickler will execute. If you see opcodes like GLOBAL or REDUCE referencing modules like os, subprocess, or builtins.exec, that is a red flag that the pickle file may contain executable payloads.
You can also optimize pickle streams to remove redundant PUT opcodes:
optimized = pickletools.optimize(pickled)
print(f"Original: {len(pickled)} bytes")
print(f"Optimized: {len(optimized)} bytes")
For automated scanning, pickletools.genops() yields each opcode as a tuple you can programmatically inspect. Here is a basic scanner that flags suspicious operations:
import pickle
import pickletools
def scan_pickle(data):
"""Scan a pickle byte stream for suspicious opcodes."""
suspicious = []
dangerous_modules = {"os", "subprocess", "builtins", "nt", "posix", "sys"}
for opcode, arg, pos in pickletools.genops(data):
if opcode.name in ("GLOBAL", "INST"):
module = arg.split(".")[0] if "." in arg else arg.split("\n")[0]
if module in dangerous_modules:
suspicious.append((pos, opcode.name, arg))
elif opcode.name == "STACK_GLOBAL":
suspicious.append((pos, opcode.name, "(check stack contents)"))
return suspicious
# Test with safe data
safe = pickle.dumps({"hello": "world"})
print(f"Safe data findings: {scan_pickle(safe)}") # []
# Test with dangerous data
import os
class Evil:
def __reduce__(self):
return (os.system, ("echo pwned",))
evil = pickle.dumps(Evil())
print(f"Evil data findings: {scan_pickle(evil)}") # Shows the os.system reference
Auditing Pickle Files with fickling
While pickletools gives you raw opcode-level visibility, the open-source tool fickling (from Trail of Bits) takes analysis further by decompiling pickle files into equivalent Python source code. This makes it far easier to understand what a pickle file will actually do when loaded.
Install it with pip install fickling, and you can use it from the command line:
# Trace the pickle virtual machine execution safely
fickling --trace suspicious_model.pkl
# Check if a pickle file contains potentially unsafe operations
fickling --check-safety suspicious_model.pkl
You can also use fickling programmatically:
import fickling
# Analyze a pickle file for safety
result = fickling.is_likely_safe("model.pkl")
print(result) # True for simple data structures, False if suspicious
The advantage of fickling over raw pickletools.dis() is readability. Where pickletools shows you STACK_GLOBAL, REDUCE, and TUPLE opcodes, fickling shows you the equivalent Python expression -- making it immediately obvious whether a pickle is calling os.system or just reconstructing a dictionary. For security auditing of ML models or any pickle files from external sources, fickling is an essential tool in the investigation workflow alongside Picklescan.
Pickle and the __slots__ Edge Case
Objects that use __slots__ instead of __dict__ require special attention during pickling, since they don't have an instance dictionary by default. This was one of the issues addressed in PEP 307. Protocol 2 and later handle slots natively, but if you need to support older protocols or want explicit control:
import pickle
class Point:
__slots__ = ("x", "y")
def __init__(self, x, y):
self.x = x
self.y = y
def __getstate__(self):
return {"x": self.x, "y": self.y}
def __setstate__(self, state):
self.x = state["x"]
self.y = state["y"]
def __repr__(self):
return f"Point({self.x}, {self.y})"
p = Point(3, 7)
restored = pickle.loads(pickle.dumps(p))
print(restored) # Point(3, 7)
Working with Pickle Files
In practice, you often want to pickle to and from files rather than working with byte strings in memory:
import pickle
records = [
{"id": 1, "name": "Alice", "score": 98},
{"id": 2, "name": "Bob", "score": 87},
{"id": 3, "name": "Charlie", "score": 92},
]
# Write to file -- always use 'wb' (write binary)
with open("records.pkl", "wb") as f:
pickle.dump(records, f, protocol=pickle.HIGHEST_PROTOCOL)
# Read from file -- always use 'rb' (read binary)
with open("records.pkl", "rb") as f:
loaded = pickle.load(f)
print(loaded)
You can also pickle multiple objects to the same file:
import pickle
config = {"learning_rate": 0.001, "epochs": 50}
weights = [0.5, -0.3, 0.8, 0.1]
history = {"loss": [0.9, 0.5, 0.3], "accuracy": [0.6, 0.8, 0.9]}
with open("checkpoint.pkl", "wb") as f:
pickle.dump(config, f)
pickle.dump(weights, f)
pickle.dump(history, f)
with open("checkpoint.pkl", "rb") as f:
loaded_config = pickle.load(f)
loaded_weights = pickle.load(f)
loaded_history = pickle.load(f)
print(loaded_config)
print(loaded_weights)
print(loaded_history)
When pickling multiple objects to a single file, the objects must be unpickled in the same order they were pickled. There is no index or table of contents -- you are reading a sequential stream of pickle operations. If the order is complex or unpredictable, consider wrapping everything in a single dictionary and pickling that instead.
The shelve Module: Pickle in Disguise
Python's shelve module provides a dictionary-like interface for persistent storage, but under the hood it uses pickle for serialization. This means every security concern about pickle applies equally to shelve, even though the API looks innocent:
import shelve
# This uses pickle under the hood
with shelve.open("mydata") as db:
db["config"] = {"theme": "dark", "language": "en"}
db["scores"] = [98, 87, 92]
# Reading it back
with shelve.open("mydata") as db:
print(db["config"])
print(db["scores"])
If you are auditing a codebase for pickle exposure, shelve is easy to overlook. Any shelve.open() call that loads files from an untrusted source carries the same risk as a direct pickle.loads() call. The Python documentation for shelve carries the same security warning as pickle itself.
The copyreg Module: Registering Pickle Support for External Types
If you work with types you didn't define -- perhaps from a C extension or a third-party library -- the copyreg module lets you register custom pickle/unpickle functions:
import pickle
import copyreg
# Suppose we have a type we can't modify
class ExternalPoint:
def __init__(self, x, y):
self.x = x
self.y = y
def __repr__(self):
return f"ExternalPoint({self.x}, {self.y})"
# Register a custom reducer
def pickle_external_point(point):
return ExternalPoint, (point.x, point.y)
copyreg.pickle(ExternalPoint, pickle_external_point)
# Now it pickles cleanly
p = ExternalPoint(10, 20)
restored = pickle.loads(pickle.dumps(p))
print(restored) # ExternalPoint(10, 20)
Auditing Your Codebase for Pickle Exposure
One of the questions that rarely gets asked in articles about pickle is: how do you find all the places in an existing codebase where pickle data could enter? The answer requires looking beyond direct pickle.loads() calls. Here is a systematic approach.
Direct Pickle Usage
Search for all direct imports and calls. A simple grep gets you started:
# Find all files that import or use pickle
grep -rn "import pickle\|from pickle\|pickle\.load\|pickle\.loads" --include="*.py" .
# Also catch cPickle (legacy Python 2 code that may still be running)
grep -rn "import cPickle\|from cPickle" --include="*.py" .
Indirect Pickle Usage
Several standard library modules use pickle internally. Search for these as well:
# shelve uses pickle under the hood
grep -rn "import shelve\|shelve\.open" --include="*.py" .
# multiprocessing uses pickle to transfer objects between processes
grep -rn "from multiprocessing\|import multiprocessing" --include="*.py" .
# xmlrpc can use pickle for certain data types
grep -rn "import xmlrpc\|from xmlrpc" --include="*.py" .
Framework-Level Pickle
Several popular frameworks and libraries use pickle in ways that may not be immediately obvious:
- PyTorch:
torch.load()uses pickle by default. Verify thatweights_only=Trueis set (the default since PyTorch 2.6). - scikit-learn:
joblib.load()uses pickle for model persistence. - Celery: The default serializer was pickle until Celery 4.0 changed it to JSON.
- Flask/Django sessions: Check your session serializer configuration.
- Redis caching: Some Redis clients use pickle to serialize cached values.
For a thorough audit, the open-source tool Bandit (a Python security linter) flags insecure deserialization calls automatically:
# Install bandit
pip install bandit
# Scan your project for security issues including pickle
bandit -r your_project/ -t B301,B302,B303
The B301, B302, and B303 test IDs specifically target pickle-related vulnerabilities. Integrating Bandit into your CI/CD pipeline ensures new pickle usage gets flagged before it reaches production.
Summary of Related PEPs
| PEP | Authors | Year | Protocol | Key Changes |
|---|---|---|---|---|
| PEP 307 | van Rossum, Peters | 2003 | 2 | Introduced __reduce_ex__, __getnewargs__, the extension registry, and formally declared unpickling unsafe. Shipped with Python 2.3. |
| PEP 3154 | Pitrou | 2014 | 4 | 64-bit framing for objects larger than 4 GiB, support for pickling more object types, data format optimizations. Default in Python 3.8 through 3.13. |
| PEP 574 | Pitrou | 2018 | 5 | Out-of-band data buffers for zero-copy pickling of large data, PickleBuffer type, buffer_callback/buffers parameters. Default since Python 3.14 (October 2025). |
Frequently Asked Questions
Is Python pickle safe to use?
Python pickle is not safe to use with untrusted data. The Python documentation states that malicious pickle data can execute arbitrary code during unpickling. Pickle should only be used with data you generated yourself and control entirely. For any data crossing a trust boundary, use JSON, MessagePack, or Safetensors instead.
What is the default pickle protocol in Python 3.14?
As of Python 3.14 (released October 2025), the default pickle protocol is protocol 5, which supports zero-copy out-of-band data buffers for large objects like NumPy arrays. Prior to Python 3.14, protocol 4 was the default. You can check the default on any installation by printing pickle.DEFAULT_PROTOCOL.
What are the best alternatives to Python pickle?
The best pickle alternatives depend on your use case. For cross-language data interchange, use JSON. For ML model storage, use the Safetensors format. For high-performance binary serialization, use MessagePack or Protocol Buffers. For tabular data, use Apache Parquet or Apache Arrow. For configuration files, use TOML (built into Python 3.11+) or YAML with safe_load.
Can you scan a pickle file for malicious code?
Yes. The open-source tools Picklescan and fickling (from Trail of Bits) can analyze pickle files for suspicious opcodes. Fickling decompiles pickle streams into equivalent Python source code, making malicious payloads easier to spot. Python's built-in pickletools module also lets you inspect opcodes directly. However, scanning is not foolproof — researchers have documented bypass techniques that evade these tools.
The Bottom Line
Pickle is a powerful serialization tool with a 30-year history in the Python ecosystem. It handles the full complexity of Python's object model -- circular references, class hierarchies, custom reconstruction logic -- in a way that no other serialization format can. That same power is what makes it a loaded weapon when pointed at untrusted data.
The Python community has spent years hardening the ecosystem around this reality. Django removed its PickleSerializer. PyTorch flipped weights_only=True to the default. NumPy defaults to allow_pickle=False. Hugging Face deploys both Picklescan and Protect AI's Guardian scanner across millions of models, and attackers keep finding ways around them. Meanwhile, researchers at Trail of Bits, JFrog, Sonatype, ReversingLabs, and Palo Alto Networks continue to discover new bypass techniques and vulnerabilities in the scanning tools themselves. The trajectory is clear: the industry is moving toward safer serialization formats for any context where trust cannot be guaranteed.
Use pickle when you control both ends of the pipeline and the data never crosses a trust boundary. Use JSON, Safetensors, Protocol Buffers, or MessagePack when it does. Audit your codebases for hidden pickle exposure through shelve, torch.load(), joblib.load(), and Redis caching layers. And if you are loading pickle files from the internet in 2026, make sure you understand exactly what those opcodes are going to do before you let them run.