defaultdict vs dict: What Every Python Developer Should Understand

Python’s built-in dict is one of the most heavily used data structures in the language. But lurking in the collections module is its specialized subclass, defaultdict, and the difference between these two goes far deeper than “one handles missing keys.”

Understanding when and why to reach for each one requires knowing their history, their internal mechanics, their performance characteristics, and the subtle behavioral gotchas that trip up even experienced developers. This article examines both data structures from the inside out: how they work at the C level, what the real performance differences are, when defaultdict can actually hurt you, and how related PEPs have shaped both types over Python’s evolution.

Origins: How defaultdict Came to Exist

The defaultdict class was introduced in Python 2.5, released in September 2006. Its arrival was the result of a long-running discussion on the python-dev mailing list in February 2006, where core developers debated how to handle the extremely common pattern of initializing dictionary values before first use. The discussion centered on a defaultdict proposal and the related __missing__ protocol.

Before defaultdict, every Python developer who needed to group, count, or accumulate values into a dictionary had to write defensive boilerplate code:

# The pre-defaultdict world: counting word frequencies
word_counts = {}
for word in words:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

Or, using the slightly more elegant dict.setdefault():

groups = {}
for key, value in pairs:
    groups.setdefault(key, []).append(value)

Both patterns work, but they require the programmer to think about initialization logic on every access. The defaultdict eliminated this friction by moving the default-value logic into the data structure itself:

from collections import defaultdict

word_counts = defaultdict(int)
for word in words:
    word_counts[word] += 1

The Python 2.5 “What’s New” documentation, authored by A.M. Kuchling, introduced defaultdict alongside other significant additions like the with statement (PEP 343) and conditional expressions (PEP 308). The collections module was explicitly described in the official documentation as providing “high-performance container datatypes,” and defaultdict was a centerpiece of that promise.

How They Actually Work: The __missing__ Protocol

At the core of defaultdict’s behavior is the __missing__ dunder method, a special method that dict.__getitem__ calls when a key lookup fails. This protocol was added to the built-in dict type in Python 2.5 specifically to support defaultdict and custom dictionary subclasses.

Here is the sequence of events when you access a missing key on a defaultdict:

  1. You write d[key], which calls d.__getitem__(key).
  2. The key is not found in the hash table.
  3. Instead of raising KeyError, __getitem__ calls d.__missing__(key).
  4. defaultdict.__missing__ checks if default_factory is None. If it is, it raises KeyError just like a regular dict.
  5. If default_factory is set, it calls self.default_factory() with no arguments.
  6. The returned value is inserted into the dictionary under the missing key.
  7. The value is returned to the caller.
Critical Distinction

Accessing a missing key on a defaultdict mutates the dictionary. It does not merely return a default value — it creates the key-value pair. This stands in contrast to dict.get(), which returns a default without modifying anything.

from collections import defaultdict

d = defaultdict(list)
result = d["nonexistent"]   # d now contains {"nonexistent": []}
print(len(d))               # 1, not 0

regular = {}
result = regular.get("nonexistent", [])  # regular is still empty
print(len(regular))                       # 0

You can also use the __missing__ protocol yourself with plain dict subclasses, without importing anything from collections:

class UpperDict(dict):
    def __missing__(self, key):
        upper_key = key.upper()
        if upper_key in self:
            return self[upper_key]
        raise KeyError(key)

d = UpperDict({"HELLO": "world"})
print(d["hello"])  # "world" -- found via __missing__

This flexibility is by design. As the official Python documentation for collections states, defaultdict “overrides one method and adds one writable instance variable. The remaining functionality is the same as for the dict class.”

The Five Approaches to Missing Keys (And When to Use Each)

Understanding defaultdict in isolation is not enough. It exists within an ecosystem of strategies for handling missing dictionary keys, and choosing the right one depends on whether you want mutation, what your default value is, and how frequently you encounter missing keys.

1. LBYL: Look Before You Leap

if key in d:
    value = d[key]
else:
    value = default

Two lookups (the in check and the retrieval) make this the slowest option for the common case where the key exists. It is also verbose. Use it only when you need to take entirely different actions depending on key presence.

2. EAFP: Try/Except

try:
    value = d[key]
except KeyError:
    value = default

Efficient when keys are usually present, because the try block adds zero overhead on the success path in CPython. Expensive when keys are frequently missing, because exception handling in Python is costly. Guido van Rossum’s long-standing guidance, reflected in the Python glossary’s entry for EAFP, is that this pattern is idiomatic for Python.

3. dict.get()

value = d.get(key, default)

Returns default (or None if omitted) without modifying the dictionary. Ideal for read-only access where you need a fallback. Cannot be used when you need to modify the returned value in place, because the returned default is not stored in the dict.

4. dict.setdefault()

value = d.setdefault(key, default)

Returns the value for key if it exists; otherwise inserts key with default and returns default. Unlike get(), this does mutate the dictionary. The downside is that, because Python evaluates all function arguments before the call, the default value is constructed on every invocation, even when the key already exists. For expensive defaults like list(), this means allocating an empty list on every call, then immediately discarding it when the key is found.

5. defaultdict

d = defaultdict(list)
d[key].append(value)

The default factory is called only when the key is missing, avoiding the wasted allocation problem of setdefault(). The factory is defined once at construction time, not repeated at every access site. This is its primary advantage in both performance and readability.

Performance: Real Benchmarks, Real Differences

The performance difference between these approaches is measurable, and it matters in tight loops over large datasets. Here is a controlled benchmark:

import timeit
from collections import defaultdict

setup = """
from collections import defaultdict
import random
random.seed(42)
keys = [random.randint(0, 999) for _ in range(1_000_000)]
"""

# Approach 1: if/else with dict
if_else_code = """
d = {}
for k in keys:
    if k in d:
        d[k] += 1
    else:
        d[k] = 1
"""

# Approach 2: dict.get()
get_code = """
d = {}
for k in keys:
    d[k] = d.get(k, 0) + 1
"""

# Approach 3: dict.setdefault()
setdefault_code = """
d = {}
for k in keys:
    d.setdefault(k, 0)
    d[k] += 1
"""

# Approach 4: defaultdict(int)
defaultdict_code = """
d = defaultdict(int)
for k in keys:
    d[k] += 1
"""

# Approach 5: try/except
try_except_code = """
d = {}
for k in keys:
    try:
        d[k] += 1
    except KeyError:
        d[k] = 1
"""

for name, code in [("if/else", if_else_code),
                    ("dict.get()", get_code),
                    ("setdefault()", setdefault_code),
                    ("defaultdict", defaultdict_code),
                    ("try/except", try_except_code)]:
    t = timeit.timeit(code, setup=setup, number=50)
    print(f"{name:15s}: {t:.3f}s")

Typical results on CPython 3.12 (relative order is consistent across machines):

if/else        : 3.912s
dict.get()     : 4.187s
setdefault()   : 5.043s
defaultdict    : 3.104s
try/except     : 3.298s

The defaultdict tends to win for counting and grouping workloads, primarily because d[k] += 1 on a defaultdict resolves to a single __getitem__ call that either finds the key (fast hash table lookup) or calls the C-level __missing__ implementation. There is no Python-level branching, no second lookup, and no wasted object allocation. The margin varies by hardware and Python version; try/except is competitive when keys are overwhelmingly present, and dict.get() can close the gap in some configurations.

Pro Tip

The belief that defaultdict is categorically faster than setdefault is a myth. When keys already exist (the “hot path” in a pre-populated dictionary), setdefault() can match or beat defaultdict because it avoids subclass dispatch overhead. Profile your specific workload before optimizing.

The Gotchas: When defaultdict Bites You

Gotcha 1: Silent Key Creation Breaks Containment Checks

from collections import defaultdict

config = defaultdict(str)
config["host"] = "localhost"
config["port"] = "5432"

# Later, someone checks for an optional key:
if config["timeout"]:
    print("Timeout is set")

# Now config has a "timeout" key with value "", and len(config) is 3
Warning

Checking for key existence with d[key] instead of key in d silently pollutes the dictionary. For configuration objects, API responses, or any data structure where the set of keys is semantically meaningful, a regular dict with explicit .get() is safer.

Gotcha 2: JSON Serialization Surprise

import json
from collections import defaultdict

d = defaultdict(list)
d["users"].append("alice")

# This works, but...
json.dumps(d)  # '{"users": ["alice"]}'

# The type information is lost. If you deserialize, you get a regular dict.
loaded = json.loads(json.dumps(d))
type(loaded)  # <class 'dict'>, not defaultdict

Round-tripping through JSON loses the default_factory, which can cause KeyError exceptions in code that expects the defaultdict behavior to survive serialization.

Gotcha 3: default_factory Is Not Picklable with Lambda

import pickle
from collections import defaultdict

d = defaultdict(lambda: "N/A")
d["name"] = "Alice"

pickle.dumps(d)  # PicklingError: Can't pickle <lambda>

If you need to serialize a defaultdict, use a named function or a built-in type like int, list, or str as the factory. Lambdas are not picklable in standard CPython.

Gotcha 4: The repr Reveals Your Implementation

from collections import defaultdict

d = defaultdict(int, {"a": 1, "b": 2})
print(d)  # defaultdict(<class 'int'>, {'a': 1, 'b': 2})

If you are returning dictionaries from a public API, a defaultdict’s repr leaks implementation details. Converting to a plain dict with dict(d) before returning is a common pattern.

Gotcha 5: Mutation on Read Is Not Thread-Safe

Neither dict nor defaultdict is thread-safe without external synchronization, but defaultdict’s mutation-on-read behavior makes it particularly hazardous in concurrent code. A seemingly innocent read like d[key] triggers a write when the key is missing, which can produce race conditions, corrupt internal state, or raise RuntimeError if another thread is iterating over the dictionary at the same time. In multi-threaded code, protect defaultdict access with a threading.Lock, or populate the dictionary in a single thread before sharing it.

Related PEPs: The Evolution of dict and defaultdict

Several PEPs have shaped how dict and defaultdict behave in modern Python. Understanding these helps you write code that takes advantage of the latest capabilities.

PEP 584 — Add Union Operators To dict (Python 3.9)

PEP 584, authored by Steven D’Aprano and Brandt Bucher and accepted by Guido van Rossum, added the | (merge) and |= (update) operators to dict. Crucially, these operators are type-aware for subclasses: merging two defaultdict objects with | returns a defaultdict, preserving the default_factory from the left operand. This was a deliberate improvement over the {**d1, **d2} unpacking syntax, which always returns a plain dict and loses subclass behavior.

from collections import defaultdict

d1 = defaultdict(int, {"a": 1, "b": 2})
d2 = defaultdict(int, {"b": 3, "c": 4})

merged = d1 | d2
print(type(merged))  # <class 'collections.defaultdict'>
print(merged)        # defaultdict(<class 'int'>, {'a': 1, 'b': 3, 'c': 4})

PEP 3119 — Introducing Abstract Base Classes (Python 3.0)

PEP 3119, authored by Guido van Rossum and Talin, formalized the Abstract Base Classes framework that underpins the collections.abc module. This PEP established MutableMapping as the abstract base class for dictionary-like objects, which both dict and defaultdict satisfy. If you write functions that accept any mapping type, type-hinting with MutableMapping rather than dict ensures your code works with defaultdict, OrderedDict, and custom dict subclasses:

from collections.abc import MutableMapping

def process_data(mapping: MutableMapping[str, list]) -> None:
    for key in mapping:
        mapping[key].sort()

PEP 372 — Adding an Ordered Dictionary to collections (Python 3.1)

PEP 372 introduced OrderedDict, which, like defaultdict, is a dict subclass in collections. While insertion order became an implementation detail of CPython’s dict in Python 3.6 (and was then formalized as a language guarantee in Python 3.7), OrderedDict demonstrates the same pattern as defaultdict: specialized subclasses that override specific behaviors while inheriting the core hash table implementation.

PEP 455 — Adding a key-transforming dictionary to collections (Rejected)

PEP 455, proposed by Antoine Pitrou, attempted to add a TransformDict to collections that would normalize keys on insertion and lookup (for example, case-insensitive dictionaries). While ultimately rejected by BDFL-Delegate Raymond Hettinger, the discussion it generated is directly relevant to defaultdict because it explored the design boundaries of dict subclassing: how much behavior modification is appropriate in a subclass versus requiring an entirely separate type. The decision reinforced the principle that collections should contain broadly useful containers rather than specialized variants.

Building Your Own: Custom __missing__ Without defaultdict

One of the most powerful and underutilized patterns in Python is implementing __missing__ on a plain dict subclass. This gives you all the auto-initialization power of defaultdict with complete control over the logic:

class CounterDict(dict):
    """A dict that auto-initializes missing keys to 0."""
    def __missing__(self, key):
        self[key] = 0
        return 0

class NestedDict(dict):
    """A dict that auto-creates nested dicts on access."""
    def __missing__(self, key):
        value = self[key] = NestedDict()
        return value

# Usage: infinite nesting without any setup
config = NestedDict()
config["database"]["primary"]["host"] = "localhost"
config["database"]["primary"]["port"] = 5432
print(config)
# {'database': {'primary': {'host': 'localhost', 'port': 5432}}}

The NestedDict pattern (sometimes called an “autovivifying dictionary,” borrowing terminology from Perl) cannot be cleanly expressed with defaultdict alone, because defaultdict(defaultdict(dict)) does not work — defaultdict’s first argument must be a callable, and defaultdict(dict) is an instance, not a callable. The correct defaultdict equivalent requires a recursive named function:

from collections import defaultdict

def nested_dict():
    return defaultdict(nested_dict)

config = nested_dict()
config["database"]["primary"]["host"] = "localhost"

Both approaches work, but the __missing__ subclass is more explicit and easier to extend with validation, logging, or type constraints.

The Decision Framework

Use a regular dict when your dictionary has a fixed or well-known set of keys, you are building configuration objects or API responses where the key set is semantically meaningful, you need to serialize the dictionary (JSON, pickle) and want predictable behavior, or you want read-only default access via .get() without mutating the dictionary.

Use defaultdict when you are grouping items into lists (defaultdict(list)), collecting unique items (defaultdict(set)), building adjacency lists for graph algorithms, or processing data in a single pass where every key will be written to at least once. For simple counting, consider collections.Counter first — it is a dict subclass purpose-built for tallying (available since Python 2.7 / 3.1) and provides conveniences like most_common() and arithmetic operators. When Counter does not fit — for example, when accumulating sums rather than counts — defaultdict(int) is the right tool. In these cases, defaultdict produces cleaner code and measurably better performance.

Use a custom dict subclass with __missing__ when your default-value logic depends on the key itself (not just a fixed factory), you need nested auto-initialization, or you want to add validation or side effects to the missing-key path.

Key Takeaways

  1. defaultdict mutates on read; dict.get() does not: This is the single most important behavioral difference and the one most likely to cause subtle bugs in production code.
  2. The __missing__ protocol is more general than defaultdict: You can implement it on any dict subclass for full control over missing-key logic without importing from collections.
  3. defaultdict wins the performance race for new-key-heavy workloads: Grouping and accumulation are its sweet spots. For pure counting, consider collections.Counter first. For pre-populated dictionaries, the advantage shrinks or disappears.
  4. Lambdas as default_factory break pickling: Use named functions or built-in callables if your defaultdict needs to cross process boundaries.
  5. Mutation on read is not thread-safe: In concurrent code, defaultdict’s silent writes on missing-key access can cause race conditions. Protect access with a lock or populate before sharing.
  6. Use MutableMapping for type hints, not dict: Code that accepts MutableMapping works with defaultdict, OrderedDict, and custom subclasses without modification.

The dict and defaultdict are not competing types. They are complementary tools that solve different problems. A regular dict is the right default for general-purpose key-value storage; a defaultdict is a purpose-built optimization for workloads where missing-key handling is the primary operation. Those behavioral differences — not performance benchmarks — should drive your choice. When they align with your use case, defaultdict produces code that is shorter, faster, and harder to get wrong. When they do not, a plain dict with explicit handling remains the better tool.

back to articles