Partition Columns in Python: The Complete Guide to Writing and Reading Partitioned Datasets

When your datasets grow past a few hundred megabytes, the way you store them on disk starts to matter -- a lot. Naively dumping everything into a single CSV or even a single Parquet file means every query has to scan the entire dataset, regardless of whether you only need last Tuesday's records or sales from a single region. Partition columns solve this by physically splitting your data into a directory structure based on column values, so your code only reads what it actually needs.

This is not a surface-level overview. We are going to build genuine understanding of what partition columns are, how they work at the filesystem level, how to use them across pandas, PyArrow, PySpark, Polars, and DuckDB, what trade-offs to watch for, and which enhancement proposals have shaped this landscape.

What Partitioning Actually Does at the Filesystem Level

Partitioning a dataset by column means writing data into a hierarchy of directories, where each directory name encodes the column name and its value. This convention originated in Apache Hive and is commonly called Hive-style partitioning. The directory names follow a key=value pattern:

The critical insight is that the region column is not stored inside the Parquet files themselves. Instead, the value is encoded in the directory path. When you read the dataset back, the engine reconstructs the column from the path. This means two things happen simultaneously: you get smaller individual files (since the partition column data is stripped out), and query engines can skip entire directories that do not match your filter condition. This optimization is called partition pruning, and it is the primary reason partitioning exists.

Apache Parquet itself was born from a collaboration between Twitter and Cloudera, first released in July 2013. The project was prototyped by Julien LeDem, then tech lead for Twitter's data processing tools, who wrote in a 2024 retrospective that he was looking for an anagram of Dremel to name the project: "Emerald was already taken, so I picked Red Elm." The format drew heavily on Google's Dremel paper's record-shredding and assembly algorithm. As the Apache Parquet Wikipedia article notes, the project "was designed as an improvement on the Trevni columnar storage format created by Doug Cutting, the creator of Hadoop," and the name was chosen to "evoke the bottom layer of a database with an interesting layout." In April 2015, Parquet graduated from the Apache Incubator to become a top-level Apache Software Foundation project. Hive, Impala, Drill, and Spark all adopted it as a shared standard for high-performance data IO.

Partitioning with Pandas

Pandas has supported writing to Parquet since version 0.21.0, but the partition_cols parameter specifically was added in version 0.24.0 (released January 2019). The parameter accepts a list of column names, and the columns are partitioned in the order they are given.

Here is a complete, working example:

import pandas as pd
import numpy as np

# Build a realistic dataset
np.random.seed(42)
n = 10000
df = pd.DataFrame({
    'order_date': pd.date_range('2024-01-01', periods=n, freq='h'),
    'region': np.random.choice(['US', 'EU', 'APAC'], size=n),
    'product': np.random.choice(['Widget', 'Gadget', 'Gizmo'], size=n),
    'revenue': np.random.uniform(10, 500, size=n).round(2)
})

# Extract year and month for partitioning
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month

# Write with partition columns
df.to_parquet(
    'sales_partitioned',
    partition_cols=['year', 'month', 'region'],
    engine='pyarrow',
    index=False
)

After execution, your filesystem looks like this:

Reading back is where the real payoff occurs. When you apply filters, the engine skips directories entirely:

# Only reads files from the year=2024/month=3/region=US directory
df_filtered = pd.read_parquet(
    'sales_partitioned',
    filters=[
        ('year', '=', 2024),
        ('month', '=', 3),
        ('region', '=', 'US')
    ]
)
Common Trap

partition_cols must receive a list, even for a single column. Passing a bare string like partition_cols='region' will iterate over the characters of the string, producing a cryptic KeyError: 'r'. This exact issue was documented in GitHub issue #27117 on the pandas repository in June 2019. The fix was merged for pandas 1.0, but the lesson remains: always wrap your partition column in a list.

# Wrong -- will raise a confusing KeyError
df.to_parquet('output', partition_cols='region')

# Correct
df.to_parquet('output', partition_cols=['region'])

PyArrow: More Control, Same Foundation

Under the hood, pandas delegates partitioned writes to PyArrow (or fastparquet). Working with PyArrow directly gives you finer-grained control, especially for larger datasets where you want to tune compression or manage row group sizes.

import pyarrow as pa
import pyarrow.parquet as pq

# Convert to Arrow table
table = pa.Table.from_pandas(df)

# Write partitioned dataset with explicit control
pq.write_to_dataset(
    table,
    root_path='sales_arrow_partitioned',
    partition_cols=['year', 'month'],
    compression='snappy',
    use_dictionary=True
)

PyArrow also supports two distinct partitioning schemes when reading. Hive partitioning expects the key=value directory names. Directory partitioning is a simpler scheme where directory names are just the values, without the key prefix (like /2024/01/15/ instead of /year=2024/month=01/day=15/). You configure this through the pyarrow.dataset module:

import pyarrow.dataset as ds

# Read Hive-style partitioned data
dataset = ds.dataset(
    'sales_arrow_partitioned',
    format='parquet',
    partitioning='hive'
)

# Apply predicate pushdown -- only reads relevant partitions
table = dataset.to_table(
    filter=(ds.field('year') == 2024) & (ds.field('month') == 6)
)
df_result = table.to_pandas()
Type Behavior on Load

The PyArrow documentation states that partition columns in the original table will have their types converted to Arrow dictionary types (pandas categorical) on load. This means that if you wrote a string column as your partition key, it comes back as a Categorical in pandas. This is usually beneficial for memory, but can surprise you if downstream code expects plain strings.

PySpark: Partitioning at Scale

In distributed computing with PySpark, partitioning becomes even more critical. PySpark distinguishes between two concepts that share the same name but operate differently: in-memory partitioning (how a DataFrame is split across executors for parallel processing) and disk partitioning (how data is organized when written to storage). The partitionBy() method on DataFrameWriter handles the latter.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionExample').getOrCreate()

# Load or create a DataFrame
df_spark = spark.createDataFrame(df.drop(columns=['year', 'month']))

# Add partition columns
from pyspark.sql.functions import year as spark_year, month as spark_month, col
df_spark = df_spark.withColumn('year', spark_year(col('order_date'))) \
                   .withColumn('month', spark_month(col('order_date')))

# Write partitioned to Parquet
df_spark.write \
    .partitionBy('year', 'month', 'region') \
    .mode('overwrite') \
    .parquet('sales_spark_partitioned')

# Read with predicate pushdown
df_read = spark.read.parquet('sales_spark_partitioned') \
    .filter((col('year') == 2024) & (col('month') == 3) & (col('region') == 'US'))
Pro Tip

In PySpark, mode('overwrite') at the DataFrameWriter level overwrites the entire output directory by default. To overwrite only the specific partitions being written (leaving other partitions intact), set spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") before writing.

Polars and DuckDB

Two newer tools in the Python data ecosystem have first-class support for partitioned datasets.

Polars

Polars uses write_parquet() for eager frames with pyarrow_options={'partition_cols': [...]} to write Hive-style partitioned datasets. For lazy frames, the newer sink_parquet() with pl.PartitionByKey() provides a native lazy path -- though the Polars documentation marks Hive partitioning in sink_parquet as unstable as of early 2026, meaning the API may evolve. The eager path via write_parquet with PyArrow options remains the stable choice:

import polars as pl

# Eager write: use write_parquet with partition_by (delegates to PyArrow internally)
df_eager = pl.read_parquet('sales_arrow_partitioned/**/*.parquet')

df_eager.write_parquet(
    'sales_polars_partitioned',
    use_pyarrow=True,
    pyarrow_options={'partition_cols': ['year', 'month']}
)

# Lazy write: use sink_parquet with pl.PartitionByKey for Hive-style output
# Note: native Hive partitioning in sink_parquet is marked unstable in current Polars;
# using write_parquet with pyarrow_options is the stable path as of 2026.
lf = pl.scan_parquet('sales_arrow_partitioned/**/*.parquet')

lf.sink_parquet(
    pl.PartitionByKey(
        'sales_polars_lazy_partitioned/',
        by=['year', 'month']
    ),
    mkdir=True
)

# Read back with Hive partition discovery and predicate pushdown
df_polars = pl.scan_parquet(
    'sales_polars_partitioned/**/*.parquet',
    hive_partitioning=True
).filter(
    (pl.col('year') == 2024) & (pl.col('month') == 6)
).collect()

DuckDB

DuckDB's SQL-first approach lets you write partitioned Parquet using the PARTITION BY clause inside COPY ... TO:

import duckdb

con = duckdb.connect()

# Register the pandas DataFrame as a DuckDB view
con.register('sales_df', df)

# Write partitioned Parquet
con.execute("""
    COPY sales_df
    TO 'sales_duckdb_partitioned'
    (FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE TRUE)
""")

# Query with automatic partition pruning
result = con.execute("""
    SELECT channel, AVG(latency_ms) AS avg_latency
    FROM read_parquet('sales_duckdb_partitioned/**/*.parquet', hive_partitioning=true)
    WHERE year = 2024 AND month = 6 AND region = 'NA'
    GROUP BY channel
""").df()

Choosing Good Partition Columns

Cross-Tool Interop Gotcha: DuckDB Writes Partition Columns Into Files

DuckDB's COPY TO ... PARTITION_BY writes partition column values both in the directory path and inside the Parquet files themselves. When you then read those files with PyArrow's Hive partitioning enabled, you end up with duplicate columns -- one reconstructed from the directory name (as a string), one from the file data (as its original type). This causes type conflicts that can silently produce unexpected results. The workaround is to either drop the partition column from your SELECT before writing (COPY (SELECT * EXCLUDE year, month FROM ...) TO ...) or read the files with hive_partitioning=false and handle the column yourself. This behavior was documented in DuckDB GitHub issue #12147. It is not a bug per se -- DuckDB follows a different convention than PyArrow -- but it will catch you off guard when mixing tools in the same pipeline.

Not every column makes a good partition key. The goal is to select columns that match your most common query filters while keeping the number of distinct values (cardinality) manageable. Here are the principles:

Low-to-moderate cardinality works best. A column with 5 to a few hundred unique values (like region, year, or product category) creates a reasonable number of directories. A column with millions of unique values (like user_id or transaction_id) creates millions of directories, each containing tiny files. This destroys performance rather than improving it.

Match your query patterns. If 90% of your queries filter by date, partition by date. If you frequently filter by both date and region, partition by both, with the higher-cardinality column (date) first.

Beware of data skew. If one partition value contains 80% of the data while others contain 1% each, your partitions are unbalanced. This leads to uneven file sizes, which hurts both storage efficiency and query parallelism.

Temporal columns are the classic choice. Partitioning by year/month or year/month/day is the bread and butter of data lake design. It aligns naturally with how data arrives (incrementally over time) and how it is queried (analyzing specific time ranges).

Here is a practical function that evaluates whether a column is a good partition candidate:

def evaluate_partition_column(df, column):
    """Analyze whether a column is suitable for partitioning."""
    n_unique = df[column].nunique()
    n_rows = len(df)
    value_counts = df[column].value_counts()

    avg_rows_per_partition = n_rows / n_unique
    skew_ratio = value_counts.max() / value_counts.min()

    print(f"Column: {column}")
    print(f"  Unique values: {n_unique}")
    print(f"  Avg rows per partition: {avg_rows_per_partition:,.0f}")
    print(f"  Skew ratio (max/min): {skew_ratio:.1f}")

    if n_unique > 1000:
        print("  WARNING: High cardinality -- likely too many partitions")
    elif avg_rows_per_partition < 100:
        print("  WARNING: Very few rows per partition -- files will be tiny")
    elif skew_ratio > 10:
        print("  CAUTION: Significant skew between partitions")
    else:
        print("  GOOD: Reasonable partition candidate")
Partition column selection quick reference
Column Type Typical Cardinality Verdict
year 5 - 20 Excellent
month 12 Excellent
region / country 5 - 250 Good
product category 10 - 100 Good
date (YYYY-MM-DD) Hundreds to thousands Use cautiously
user_id Millions Avoid
transaction_id Millions Avoid
float / price Effectively unbounded Avoid

Multi-Level Partitioning and Column Order

When you partition by multiple columns, the order matters. The first column in the list creates the top-level directories, and subsequent columns create nested subdirectories. This hierarchy determines which filters trigger pruning at each level.

# Order 1: year -> region
# Good when queries always filter by year first
df.to_parquet('data', partition_cols=['year', 'region'])

# Order 2: region -> year
# Good when queries always filter by region first
df.to_parquet('data', partition_cols=['region', 'year'])

In practice, put the column you filter on most frequently at the top level. If you always query a specific year first and then optionally filter by region, put year first. The reasoning is that the top-level directory structure is what gets pruned first during file listing operations.

Type Conversion Gotchas

Partition columns undergo type transformations during the write-read cycle that can catch you off guard.

When writing, the partition column values become directory names -- meaning they are converted to strings. When reading, the engine infers types back from those directory strings. This round-trip does not always preserve the original type perfectly. Dates become strings, floats can lose precision, and None values get encoded as a special directory name (PyArrow uses __HIVE_DEFAULT_PARTITION__ by default).

import pandas as pd

df = pd.DataFrame({
    'value': [1, 2, 3],
    'category': pd.Categorical(['A', 'B', 'A']),
    'score': [1.5, 2.7, 3.14159]
})

# Write partitioned by score -- float as directory name
df.to_parquet('float_partition', partition_cols=['score'])

# Read back
df_back = pd.read_parquet('float_partition')
print(df_back['score'].dtype)  # Now categorical, not float64
Best Practice

Use integer or string columns as partition keys. If you need to partition by a date, extract year/month/day as integer columns and partition by those. This avoids the float-to-string-to-categorical round-trip problem entirely.

Performance: When Partitioning Helps and When It Hurts

Partitioning is not free. Every partition directory adds filesystem overhead, and the engine must list and traverse directories before reading. For small datasets (under a few hundred MB), partitioning often makes things slower because the overhead of directory traversal outweighs any scanning savings.

The sweet spot emerges with datasets in the multi-gigabyte range where queries typically touch a small fraction of the data. Wes McKinney, creator of pandas and co-creator of Apache Arrow, has written extensively about Parquet performance, demonstrating read throughput exceeding 4 GB/s with multithreaded column reads on low-entropy data. That throughput depends on efficient data layout, and partitioning is a key component of that layout for large datasets.

The Apache Arrow project, which McKinney co-founded, describes itself as an ideal in-memory transport layer for data that is being read or written with Parquet files. The synergy between Arrow's in-memory columnar format and Parquet's on-disk columnar format is what makes partition pruning so effective: the engine identifies relevant partitions from directory names without touching the data, then reads only the necessary files directly into Arrow's columnar memory layout.

Putting It All Together

Here is a complete pipeline that demonstrates writing, reading, and querying a partitioned dataset with proper type handling:

import pandas as pd
import pyarrow.dataset as ds
import numpy as np

# Generate sample data
np.random.seed(0)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='h')
df = pd.DataFrame({
    'timestamp': np.random.choice(dates, 50000),
    'region': np.random.choice(['NA', 'EU', 'APAC', 'LATAM'], 50000),
    'channel': np.random.choice(['web', 'mobile', 'api'], 50000),
    'latency_ms': np.random.exponential(200, 50000).round(1),
    'success': np.random.choice([True, False], 50000, p=[0.95, 0.05])
})

# Prepare partition columns as clean integers
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month

# Write partitioned dataset
df.to_parquet(
    'api_metrics',
    partition_cols=['year', 'month', 'region'],
    engine='pyarrow',
    compression='snappy',
    index=False
)

# Read with predicate pushdown using PyArrow datasets
dataset = ds.dataset('api_metrics', format='parquet', partitioning='hive')

# Only reads files matching the filter -- skips everything else
table = dataset.to_table(
    filter=(
        (ds.field('year') == 2024) &
        (ds.field('month') == 6) &
        (ds.field('region') == 'NA')
    ),
    columns=['timestamp', 'channel', 'latency_ms', 'success']
)

result = table.to_pandas()
print(f"Read {len(result)} rows from targeted partition")
print(f"Mean latency: {result['latency_ms'].mean():.1f}ms")
print(f"Success rate: {result['success'].mean():.1%}")

Key Takeaways

  1. Partitioning splits data into a directory hierarchy that enables query engines to skip irrelevant files entirely -- this is called partition pruning and it is the whole point.
  2. The technique works across the Python ecosystem. Pandas, PyArrow, PySpark, Polars, and DuckDB all support Hive-style partitioning with consistent key=value directory naming.
  3. Always pass partition_cols as a list in pandas, even for a single column. Passing a bare string iterates over individual characters and produces a confusing KeyError.
  4. Choose partition columns based on query patterns and cardinality. Low-to-moderate cardinality columns that align with your most common filters are the right choice. High-cardinality columns like user_id create too many tiny files and hurt performance.
  5. Column order in multi-level partitioning matters. Put the column you filter on most frequently at the top level -- it is pruned first during directory listing operations.
  6. Type conversion happens during the write-read cycle. Partition column values are serialized as directory name strings and may come back as categoricals. Use integer or string columns as partition keys to keep this predictable.
  7. Partitioning adds overhead that only pays off at scale. For datasets under a few hundred MB, a single file is often faster. The benefits emerge clearly in the multi-gigabyte range with selective queries.

The Parquet format and its partitioning conventions represent a design that has been refined across an ecosystem of projects over more than a decade. Understanding how it works at the filesystem level -- not just which function to call -- is what separates developers who can debug data pipelines from those who simply copy examples. That is the kind of comprehension that makes a difference.