PEP 247: Python's Standard API for Cryptographic Hash Functions

Before PEP 247, Python developers who needed cryptographic hashing had to learn a different interface for every library they touched. PEP 247 put an end to that chaos by defining a single, consistent API that any hash module could adopt — making it trivial to swap implementations without rewriting your code.

Cryptographic hash functions are everywhere in software: verifying file integrity, storing passwords, generating message authentication codes, and signing data. Python's ecosystem has long had multiple libraries for this work — some in the standard library, some third-party. The problem before 2001 was simple and frustrating: each module made its own choices about method names, argument order, and what attributes to expose. Code written for one hash library couldn't easily be pointed at another.

PEP 247 was authored by A.M. Kuchling and finalized in 2001. Its goal was purely informational: lay down a contract that any cryptographic hash module could implement, so Python programmers could write code that worked against the interface rather than any specific library.

What Is PEP 247 and Why Did It Matter?

A Python Enhancement Proposal, or PEP, is a design document that either introduces a new feature, describes a process, or (in the case of informational PEPs like this one) establishes a standard for the community to follow. PEP 247 is informational: it doesn't change the Python language or standard library directly. Instead, it defines what a well-behaved hash module should look like.

The value of this kind of standardization is straightforward. Imagine you're building a tool that computes checksums. If you code against the PEP 247 interface, your tool works with any compliant hash module — whether that's the standard library's hashlib, a third-party library like PyCryptodome, or a future implementation you haven't heard of yet. Swapping algorithms becomes a one-line change rather than a refactor.

Note

PEP 247 was finalized in 2001, years before Python's hashlib module appeared in Python 2.5 (2006). PEP 247 established the design principles that hashlib was later built to follow.

The API Specification, Piece by Piece

The PEP defines three categories of things a compliant module must provide: a constructor function, a module-level variable, and a set of methods on the hashing object itself.

The Constructor: new()

Every hash module must expose a function called new(). For unkeyed hashes like MD5 or SHA, it looks like this:

new([string])

For keyed hashes like HMAC, a key parameter comes first and is required:

new([key], [string])

In both cases, the optional string argument lets you seed the hash object with initial data at construction time. This is exactly equivalent to calling obj.update(string) immediately after creating the object — it's just a convenience shortcut. Additional keyword arguments are permitted as long as they have sensible defaults.

The Module-Level Variable: digest_size

Each compliant module must expose an integer variable named digest_size. This tells you the size of the hash output in bytes. For MD5, that's 16 bytes (128 bits). For SHA-256, it's 32 bytes (256 bits). For hash algorithms that support variable output sizes, this variable is set to None.

import hashlib

# Access digest_size at the module level
print(hashlib.md5().digest_size)   # 16
print(hashlib.sha256().digest_size)  # 32

Pro Tip

digest_size is measured in bytes, not bits. If you need the bit length for display or documentation purposes, multiply by 8. The PEP chose bytes because that's what you'll use in almost every real operation — seeking through files, allocating buffers, and computing string lengths all work in bytes.

Object Methods

Once you've created a hashing object via new(), it must support four methods:

update(string) — Feed more data into the hash. You can call this as many times as you need, in whatever chunk size suits your use case. This is what makes hash objects practical for hashing large files without loading them entirely into memory.

digest() — Return the current hash value as raw bytes (a string of 8-bit data in Python 2 terminology; a bytes object in Python 3). Critically, calling digest() does not change the state of the object. You can keep feeding data in with update() afterward.

hexdigest() — Return the current hash value as a hexadecimal string. Lowercase letters are specified for the digits a through f. Like digest(), this method does not alter the object's state.

copy() — Return an independent copy of the hashing object. Updating the copy will not affect the original. This is useful when you want to branch: hash a common prefix once, then fork into multiple objects to hash different suffixes without redundant work.

The Object-Level digest_size Attribute

The hashing object itself must also carry a digest_size attribute. For fixed-size algorithms this matches the module-level variable. For variable-size algorithms, this attribute must reflect the specific output size chosen when the object was created — so it will never be None at the object level, even if the module-level variable is.

Unkeyed vs. Keyed Hashes

PEP 247 draws an explicit distinction between two categories of hash function. Understanding this distinction helps you know when to reach for each type.

Unkeyed hashes like MD5, SHA-1, SHA-256, and SHA-3 produce a fixed-size digest from arbitrary input. They're deterministic: the same input always produces the same output. Anyone with the input data can verify the hash. These are useful for integrity checking — verifying a downloaded file, for example — but not for authentication, because there's no secret involved.

Keyed hashes like HMAC combine a secret key with a hash algorithm to produce a message authentication code. The same data hashed with different keys produces different outputs. Without the key, you can't reproduce the MAC, which is what makes it useful for authentication. In the PEP 247 constructor for keyed hashes, the key parameter is required and comes first.

Security Note

MD5 and SHA-1 are no longer considered secure for cryptographic purposes. They remain in Python's hashlib for legacy compatibility and non-security uses like checksums, but new code should use SHA-256 or stronger. For password hashing specifically, use a dedicated algorithm like bcrypt, scrypt, or argon2 rather than a general-purpose hash.

Practical Examples

Here's how the PEP 247 interface looks in practice using Python's hashlib, which was designed to honor this contract.

Basic hashing with update() and hexdigest()

import hashlib

# Create a SHA-256 hash object
h = hashlib.sha256()

# Feed data in chunks (great for large files)
h.update(b"Hello, ")
h.update(b"PythonCodeCrack!")

# Get the hex digest without altering the object
print(h.hexdigest())
# a3f2c... (64 hex characters)

# Can still continue updating
h.update(b" More data.")
print(h.hexdigest())  # Different result

Using the string shortcut at construction

import hashlib

# Pass data directly to the constructor
h = hashlib.sha256(b"Hello, PythonCodeCrack!")
print(h.hexdigest())

Branching with copy()

import hashlib

# Hash a common prefix
base = hashlib.sha256(b"shared-prefix:")

# Fork into two independent objects
branch_a = base.copy()
branch_b = base.copy()

branch_a.update(b"path-A")
branch_b.update(b"path-B")

print(branch_a.hexdigest())  # Hash of "shared-prefix:path-A"
print(branch_b.hexdigest())  # Hash of "shared-prefix:path-B"
# Original base object is unaffected

Hashing a large file efficiently

import hashlib

def hash_file(filepath, algorithm="sha256", chunk_size=65536):
    """Hash a file without loading it entirely into memory."""
    h = hashlib.new(algorithm)
    with open(filepath, "rb") as f:
        while chunk := f.read(chunk_size):
            h.update(chunk)
    return h.hexdigest()

checksum = hash_file("/path/to/large_file.iso")
print(checksum)

Pro Tip

hashlib.new(algorithm_name) is the most flexible way to create a hash object when the algorithm name is a variable rather than a hard-coded choice. It accepts any algorithm name supported by your OpenSSL installation, and it's exactly the kind of interchangeable usage PEP 247 was designed to enable.

Writing algorithm-agnostic code

import hashlib

def compute_digest(data: bytes, hash_module) -> str:
    """
    Works with any PEP 247-compliant hash module or hashlib algorithm.
    The caller decides the algorithm; this function doesn't care.
    """
    h = hash_module.new(data)
    return h.hexdigest()

# Swap algorithms with zero changes to compute_digest
print(compute_digest(b"test data", hashlib.sha256))
print(compute_digest(b"test data", hashlib.sha512))

Design Decisions Worth Understanding

Several choices in PEP 247 might look arbitrary at first, but each had a concrete reason behind it.

Why bytes, not bits, for digest_size? The PEP's author surveyed real code and found that the byte length was needed far more often than the bit length — for buffer allocation, file seeking, and string length calculations. Bit counts are what you see in algorithm names (SHA-256, AES-128), but once you're writing code, you almost always need bytes. The PEP puts the convenience where it's used, leaving the multiplication by 8 as a rare task for those who genuinely need it.

Why keep the method named update() instead of append()? Some reviewers suggested append() would be more intuitive. The PEP rejected this in favor of update(), which already appeared in Python's built-in md5 and sha modules. Consistency with existing code won out over a slightly better name.

Why does key come before string in keyed constructors? key is a required parameter for keyed hashes, and Python convention places required parameters before optional ones. This means the position of string shifts from first (in unkeyed hashes) to second (in keyed hashes). The PEP acknowledged this could cause confusion — someone might pass a string to a keyed hash thinking they're hashing it, when really they're supplying it as the key — but concluded this edge case wasn't common enough to warrant a more confusing interface overall.

Why does digest() not alter object state? This is a deliberate choice for usability. If calling digest() consumed or reset the hash state, you could only check the hash at the very end. By making it non-destructive, the API lets you checkpoint a hash at any point during a streaming operation, which is useful for progress reporting, partial verification, and the branching pattern that copy() enables.

PEP 247 Today: hashlib and Beyond

When Python 2.5 introduced hashlib in 2006, it superseded the older built-in md5 and sha modules. The hashlib design is directly informed by PEP 247: new(), update(), digest(), hexdigest(), copy(), and digest_size are all present and behave exactly as the PEP specified.

Third-party cryptography libraries that target Python, including PyCryptodome and the cryptography package, also honor the PEP 247 contract for their hash implementations. This means code that was written to the interface — rather than to any specific library — has remained portable for over two decades.

Python 3 made one notable surface change: the string parameter that PEP 247 described has become a bytes parameter, reflecting the strict text/binary split that Python 3 enforces. The underlying contract is identical; only the type annotation has evolved with the language.

Starting in Python 3.9, hashlib also gained support for passing a usedforsecurity keyword argument to new(). This lets you use algorithms like MD5 in non-security contexts (log file checksums, hash table seeds) on systems where OpenSSL restricts their use in FIPS mode. This is the kind of optional extension PEP 247 explicitly anticipated when it noted that additional keyword arguments with sensible defaults were permitted.

import hashlib

# Non-security use of MD5 on FIPS-compliant systems
h = hashlib.new("md5", usedforsecurity=False)
h.update(b"cache key data")
print(h.hexdigest())

Key Takeaways

PEP 247 is an interface contract, not an implementation. It defines what every compliant hash module must expose: a new() constructor, a module-level digest_size, and four object methods (update, digest, hexdigest, copy). Any module that follows this contract can be swapped in for another with minimal code changes.
digest_size is in bytes. Always multiply by 8 if you need the bit count for display or documentation. The byte count is what you'll use in actual code.
digest() and hexdigest() are non-destructive. You can call them at any point during a hashing operation and continue feeding data with update() afterward. This enables checkpointing and intermediate verification.
copy() enables efficient branching. Hash a common prefix once, then fork the object to hash different suffixes without redundant computation. This is a practical pattern for Merkle trees, incremental integrity checks, and similar use cases.
hashlib was built on this foundation. Python's standard hash library honors the PEP 247 contract, which is why code written against that interface has stayed portable across Python versions, operating systems, and library versions for more than two decades.

PEP 247 is a small document with an outsized impact. By settling on a common API before the ecosystem fragmented further, it gave Python's cryptography tooling a coherent shape. Whether you're computing a file checksum, building a content-addressable store, or implementing a custom protocol, the interface you reach for in hashlib today carries the DNA of this 2001 specification.