Understanding Polars in Python: The DataFrame Library That Rewrites the Rules

In 2020, a Dutch engineer named Ritchie Vink was stuck at home during COVID lockdowns, frustrated with Pandas. He started building a DataFrame library in Rust as a side project. Five years later, that project has surpassed 250 million total downloads, grown to over 23 million monthly users, and has raised approximately $25 million in venture funding across two rounds. Companies like Netflix, Microsoft, and G-Research run it in production.

Polars is not a Pandas wrapper. It is not a thin skin over NumPy. It is a from-scratch query engine that borrows ideas from database internals and applies them to the DataFrame interface that Python developers already know. Understanding Polars means understanding why it is fast, not just that it is fast, and knowing when its design choices actually matter for your work.

This article covers the architecture that makes Polars different, walks through real code that demonstrates each core concept, explains the lazy evaluation system that delivers the performance gains, and charts where the project is headed. Real code, real understanding.

The Architecture: Why Polars Is Different from Pandas

Pandas was created by Wes McKinney at AQR Capital Management in 2008. It was built on top of NumPy, inheriting NumPy's row-major memory layout and single-threaded execution model. For over a decade, it served the Python data science ecosystem remarkably well. But Pandas carries architectural decisions from an era when datasets fit in memory on a single core, and those decisions have consequences at scale.

Polars starts from different foundations entirely. Three architectural choices define its performance characteristics.

First, Polars is written in Rust with Python bindings. The core engine compiles to native machine code. There is no Python interpreter overhead in the hot path. When Polars processes a column of a million integers, that loop runs in compiled Rust, not interpreted Python. The Python API you interact with is a thin layer that constructs query plans and dispatches them to the Rust engine.

On the Talk Python to Me podcast (Episode 402, recorded January 2023), Ritchie Vink articulated a design philosophy centered on avoiding unnecessary work — the ideal outcome being the computation that never needs to happen at all. By pushing filters down to the data source, Polars avoids loading or processing rows that fail a filter condition. This principle runs through the entire architecture.

Second, Polars uses Apache Arrow Columnar Format as its memory model. Arrow stores data in columns rather than rows. When you compute the mean of a single column, a columnar layout means the CPU reads contiguous memory rather than skipping across rows. This is not a minor optimization. Modern CPUs are dramatically faster when they can predict memory access patterns, and columnar access is as predictable as it gets.

Arrow also enables zero-copy interoperability with other Arrow-based tools. You can pass data between Polars, DuckDB, and PyArrow without copying or converting. The data stays in the same memory layout, and each library reads it natively.

Third, Polars is multi-threaded by default. Because the engine is Rust, it is not constrained by Python's Global Interpreter Lock (GIL). When Polars runs a group-by aggregation, it partitions the work across all available CPU cores automatically. You do not configure thread pools, manage workers, or think about parallelism. It happens.

Getting Started: Core Operations

Install Polars and start working:

pip install polars

Creating a DataFrame looks familiar if you have used Pandas, but the behavior diverges quickly:

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "department": ["Engineering", "Marketing", "Engineering", "Marketing", "Engineering"],
    "salary": [95000, 72000, 105000, 68000, 112000],
    "years": [5, 3, 8, 2, 10],
})

print(df)
shape: (5, 4)
┌─────────┬─────────────┬────────┬───────┐
│ name    ┆ department  ┆ salary ┆ years │
│ ---     ┆ ---         ┆ ---    ┆ ---   │
│ str     ┆ str         ┆ i64    ┆ i64   │
╞═════════╪═════════════╪════════╪═══════╡
│ Alice   ┆ Engineering ┆ 95000  ┆ 5     │
│ Bob     ┆ Marketing   ┆ 72000  ┆ 3     │
│ Charlie ┆ Engineering ┆ 105000 ┆ 8     │
│ Diana   ┆ Marketing   ┆ 68000  ┆ 2     │
│ Eve     ┆ Engineering ┆ 112000 ┆ 10    │
└─────────┴─────────────┴────────┴───────┘

Notice the type annotations beneath each column name. Polars is strict about data types. Where Pandas might silently convert an integer column to float when a null appears, Polars preserves the declared type and represents nulls natively. Fewer surprises, fewer bugs.

Note

Polars DataFrames do not have an index. In Pandas, the index is a powerful but often confusing feature that leads to alignment issues, reset_index calls scattered through code, and subtle bugs when merging datasets. Polars removes the concept entirely. Rows are accessed by position or by filtering on column values, and the API is simpler for it.

Expressions: The Core Abstraction

The expression system is what separates Polars from a "faster Pandas." Expressions describe computations on columns, and they compose naturally:

result = df.select(
    pl.col("name"),
    pl.col("salary").mean().alias("avg_salary"),
    (pl.col("salary") / pl.col("years")).alias("salary_per_year"),
)

print(result)

The pl.col() function refers to a column by name. Expressions can be chained, combined with arithmetic, and aliased to create new column names. This is not string manipulation or method chaining on a DataFrame object. Expressions are declarative descriptions of what you want computed, and Polars decides how to compute it efficiently.

Filtering, selecting, and transforming all use the same expression language:

# Filter rows where salary exceeds 90000
high_earners = df.filter(pl.col("salary") > 90000)

# Add a computed column
with_bonus = df.with_columns(
    (pl.col("salary") * 0.1).alias("bonus")
)

# Group and aggregate
dept_stats = df.group_by("department").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.len().alias("headcount"),
)

print(dept_stats)
shape: (2, 4)
┌─────────────┬────────────┬────────────┬───────────┐
│ department  ┆ avg_salary ┆ max_salary ┆ headcount │
│ ---         ┆ ---        ┆ ---        ┆ ---       │
│ str         ┆ f64        ┆ i64        ┆ u32       │
╞═════════════╪════════════╪════════════╪═══════════╡
│ Engineering ┆ 104000.0   ┆ 112000     ┆ 3         │
│ Marketing   ┆ 70000.0    ┆ 72000      ┆ 2         │
└─────────────┴────────────┴────────────┴───────────┘
Pro Tip

Multiple expressions passed to agg() can execute in parallel. Polars knows the computations are independent and distributes them across cores. In Pandas, you would typically chain .groupby().agg() calls that run sequentially.

Lazy Evaluation: Where the Real Performance Lives

Every discussion of Polars performance eventually arrives at lazy evaluation, and for good reason. The lazy API is what transforms Polars from "a faster DataFrame library" into "a query engine that happens to have a DataFrame API."

In eager mode, each operation executes immediately:

# Eager: every line runs immediately
df = pl.read_csv("flights.csv")                         # reads entire file
df = df.filter(pl.col("origin") == "SEA")               # scans all rows
df = df.select(["origin", "destination", "delay"])       # keeps 3 columns

In lazy mode, operations build a query plan that executes only when you call .collect():

# Lazy: builds a plan, optimizes, then executes once
result = (
    pl.scan_csv("flights.csv")
    .filter(pl.col("origin") == "SEA")
    .select(["origin", "destination", "delay"])
    .collect()
)

The difference is not just deferred execution. Before running, Polars' query optimizer analyzes the entire plan and applies transformations.

Predicate pushdown moves filter operations as early as possible. In the example above, the filter origin == "SEA" gets pushed down to the CSV reader. Polars skips rows that do not match while reading the file, rather than loading everything into memory and then filtering. For a file with millions of rows where only a fraction match, this can be the difference between seconds and minutes.

Projection pushdown identifies which columns are actually used by the query and reads only those. The select call asks for three columns, so Polars tells the CSV reader to ignore every other column in the file. For wide datasets with dozens or hundreds of columns, this eliminates enormous amounts of wasted I/O.

Common subexpression elimination detects when multiple expressions compute the same intermediate result and computes it once rather than repeating the work.

You can inspect the optimized plan before executing:

lazy_plan = (
    pl.scan_csv("flights.csv")
    .filter(pl.col("origin") == "SEA")
    .select(["origin", "destination", "delay"])
)

print(lazy_plan.explain())

The output shows you exactly what the optimizer did, including which predicates were pushed down and which columns were projected. Polars can detect type mismatches and errors before execution begins, rather than failing halfway through a long computation.

Eager vs. Lazy: When to Use Each

The lazy API is preferred for almost all production use cases. The Polars documentation states that lazy evaluation is the preferred and highest-performance mode of operation. However, the eager API has legitimate uses.

Use eager mode when exploring data interactively in a notebook, when the dataset fits comfortably in memory and performance is not a concern, or when you need the result of one operation to decide what to do next (conditional logic that cannot be expressed as a single query plan).

Use lazy mode when reading from files (especially Parquet and CSV), when performing multiple transformations that can be optimized as a pipeline, when working with large datasets, or when building production data pipelines.

The transition between the two is trivial:

# Eager DataFrame to LazyFrame
lazy = df.lazy()

# LazyFrame back to eager DataFrame
eager = lazy.collect()

Reading Data: scan_ vs. read_

Polars provides paired functions for data ingestion. The read_ variants load data eagerly. The scan_ variants create lazy query plans:

# Eager: loads entire file into memory immediately
df = pl.read_csv("data.csv")
df = pl.read_parquet("data.parquet")
df = pl.read_json("data.json")

# Lazy: creates a scan plan, reads only what's needed
lf = pl.scan_csv("data.csv")
lf = pl.scan_parquet("data.parquet")
lf = pl.scan_ndjson("data.ndjson")

Parquet files deserve special mention. Parquet is a columnar file format that stores metadata about row groups and columns. When Polars scans a Parquet file lazily, it reads the metadata first, then uses predicate and projection pushdown to read only the row groups and columns the query actually needs. For a 10 GB Parquet file where your query touches two columns and 5% of the rows, Polars may read less than 100 MB from disk.

# This might touch only a tiny fraction of a huge Parquet file
result = (
    pl.scan_parquet("massive_dataset.parquet")
    .filter(pl.col("country") == "Netherlands")
    .select(["city", "population"])
    .collect()
)

The Streaming Engine: Beyond RAM Limits

One of the persistent criticisms of in-memory DataFrame libraries is that they choke when the dataset exceeds available RAM. Polars addressed this with its streaming engine, completely rebuilt from scratch for the 1.0 era based on research into Morsel-Driven Parallelism — a NUMA-aware query execution framework originally published by Leis et al. at ACM SIGMOD 2014. The Polars 1.0 announcement described this as a hybrid push/pull design that compiles operator pipelines down to Rust async state machines, combining cache locality from morsel-driven execution with flexible operator design.

The streaming engine processes data in chunks (morsels) that fit in memory, applying the same optimizations as the in-memory engine but without requiring the full dataset to be loaded at once. You invoke it by passing engine='streaming' to collect(), or by using the sink_ methods which write results directly to disk:

# Run the query using the streaming engine
result = (
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .collect(engine="streaming")
)

# Or sink directly to disk without collecting into memory at all
(
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .sink_parquet("output.parquet")
)

Polars published benchmarks in December 2025 showing that as dataset size increases, the streaming engine's performance advantage over the in-memory engine grows significantly. For workloads that exceed available RAM, the streaming engine is the difference between completion and an out-of-memory crash.

Streaming Limitations

The streaming engine is not universally faster. Operations that require global ordering — such as rolling window functions, certain window aggregations, and time-series operations — need more synchronization in streaming mode than in-memory. Polars automatically falls back to the in-memory engine for those operations when you invoke streaming, so you won't get incorrect results, but you also won't always get a speedup.

GPU Acceleration: A Third Execution Path

Beyond the in-memory engine and the streaming engine, Polars now has a third execution path: GPU acceleration via NVIDIA RAPIDS cuDF. As of September 2024, this is available in open beta. You enable it with a single argument change to collect():

# Install with GPU support
# pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com

result = (
    pl.scan_parquet("large_dataset.parquet")
    .filter(pl.col("country") == "US")
    .group_by("product_category")
    .agg(
        pl.col("revenue").sum(),
        pl.col("units").mean(),
    )
    .collect(engine="gpu")  # that's the only change
)

The design is tightly coupled rather than loosely coupled. Rather than reimplementing the Polars API in cuDF, NVIDIA engineers integrated at the level of the Intermediate Representation (IR) produced by the Polars optimizer. The same query semantics, the same Python type inference, and the same optimizations apply — only the execution layer differs. When operations aren't supported on the GPU, the engine falls back to CPU transparently without throwing an error.

According to NVIDIA's technical blog announcing the open beta (September 2024), the GPU engine powered by RAPIDS cuDF offers up to 13x speedup versus Polars on CPU for compute-bound queries involving complex group-bys and joins, benchmarked on PDS-H queries at scale factor 80 using an NVIDIA H100. The speedups are most pronounced on aggregation-heavy workloads and string operations; simple scans or highly I/O-bound queries benefit less.

GPU execution requires an NVIDIA GPU with compute capability 7.0 or higher (Volta architecture or later), CUDA 12, and the separate polars[gpu] extras install. It is currently a single-GPU implementation, which limits it to datasets that fit in GPU VRAM. For workloads that hit this boundary, Polars supports Unified Virtual Memory (UVM), introduced in the RAPIDS 24.12 release in December 2024, which allows GPU processing to spill data to system RAM, extending the practical limit beyond physical GPU memory.

When GPU Mode Makes Sense

GPU acceleration pays off most when your query plan is dominated by grouped aggregations, joins, or string operations on tens to hundreds of millions of rows — the workload profile where CPU parallelism saturates but GPU parallelism scales further. If your pipeline is I/O-bound (reading slow storage), the GPU will sit idle waiting for data. Profile first: if collect() completes in under two seconds on CPU, the GPU overhead of data transfer may not be worth it.

Polars vs. Pandas: A Concrete Comparison

Rather than abstract claims about speed, here is the same analytical task expressed in both libraries. The task: read a CSV, filter rows, compute grouped aggregations, and sort the result.

Pandas:

import pandas as pd

df = pd.read_csv("sales.csv", parse_dates=["date"])
df = df[df["region"] == "West"]
result = (
    df.groupby("product")
    .agg(
        total_revenue=("revenue", "sum"),
        avg_quantity=("quantity", "mean"),
        num_orders=("order_id", "count"),
    )
    .sort_values("total_revenue", ascending=False)
)

Polars (lazy):

import polars as pl

result = (
    pl.scan_csv("sales.csv", try_parse_dates=True)
    .filter(pl.col("region") == "West")
    .group_by("product")
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("quantity").mean().alias("avg_quantity"),
        pl.len().alias("num_orders"),
    )
    .sort("total_revenue", descending=True)
    .collect()
)

The Polars version reads only the columns it needs, applies the filter during CSV parsing rather than after, parallelizes the three aggregation computations, and does all of this in compiled Rust code. The Pandas version reads every column, loads everything into memory, filters after loading, and runs each aggregation sequentially on a single core.

The Polars website's own PDS-H benchmark results (run on a c3-highmem-22 at scale factor 10 including I/O) show Polars achieving more than 30x performance gains versus Pandas. A 2024 study by Felix Nahrstedt et al., published in the Proceedings of the Evaluation and Assessment in Software Engineering (EASE) conference, compared energy efficiency and performance of Python data processing libraries. Their findings confirmed that Polars consumed approximately 8 times less energy than Pandas in synthetic data analysis tasks with large DataFrames. For TPC-H benchmarks, Polars was roughly 40% more energy-efficient than Pandas for large datasets.

Independent benchmarks and Polars' own PDS-H results consistently show 5 to 20 times the throughput of Pandas for typical DataFrame operations, with some aggregation-heavy workloads at larger scale factors achieving substantially higher gains still.

Migration Gotchas: What Nobody Warns You About

The surface API of Polars looks familiar enough that migration seems simple. It often isn't. Here are the conceptual gaps that trip up Pandas veterans most reliably.

Nulls are not NaN, and the distinction matters. Pandas uses np.nan for missing floats and None for object columns — an inconsistent system that leads to silent bugs. Polars uses a single first-class null concept across all types. When you migrate code, any logic that checks for pd.isna(), compares to np.nan, or handles missing data through NumPy semantics needs to be rewritten using pl.col("x").is_null() and .fill_null(). Getting this wrong produces correct-looking output with subtly wrong null handling.

There is no SettingWithCopyWarning — and that's not just a convenience. Pandas raises SettingWithCopyWarning when chained indexing makes it ambiguous whether you're modifying a copy or a view. Polars eliminates the concept entirely: DataFrames are immutable by default. Every transformation returns a new frame. Code like df[df["x"] > 5]["y"] = 10 simply doesn't exist in Polars. You use .with_columns(pl.when(pl.col("x") > 5).then(...).otherwise(pl.col("y")).alias("y")) instead. The mental model shift is more significant than it first appears — you stop thinking in mutations and start thinking in transformations.

Group-by result ordering is not guaranteed. In Pandas, groupby().agg() preserves the sorted order of the group keys by default. In Polars, group_by() does not. If your downstream logic depends on ordered group output, add an explicit .sort(). This is one of the most common sources of subtle test failures when porting production code.

UDFs (apply/map) are where you pay for leaving Polars. Pandas' df.apply() is slow but familiar. Polars has .map_elements(), but every call to it breaks out of the Rust engine and enters Python. A UDF applied to a 10-million-row column means 10 million Python function calls, eliminating most of the performance advantage. The correct first response is to express the transformation as a Polars expression. String operations via pl.col("x").str.*, datetime operations via pl.col("x").dt.*, and list operations via pl.col("x").list.* cover a large fraction of what people reach for apply to do. When the logic genuinely cannot be expressed as a native Polars expression — proprietary algorithms, specialized NLP operations, stateful calculations — expression plugins are the proper solution: compile custom logic as a Rust library and register it as a native Polars expression. This approach is described in detail below.

Migration Strategy

Rather than rewriting entire pipelines at once, consider a boundary approach: keep Pandas for the parts of your stack that depend on it (visualization libraries, scikit-learn inputs, existing tests), and use Polars for the data-heavy transformation layers where performance actually matters. The df.to_pandas() and pl.from_pandas() conversion functions make this hybrid approach practical. Convert at the boundaries, not throughout.

Advanced Patterns: Window Functions, Conditionals, and Joins

Polars' expression system supports patterns that would require awkward workarounds in Pandas.

Window functions compute aggregations within groups without collapsing the DataFrame:

df = df.with_columns(
    pl.col("salary").mean().over("department").alias("dept_avg"),
    pl.col("salary").rank().over("department").alias("dept_rank"),
)

The .over() method is Polars' equivalent of SQL window functions. It partitions the data by the specified column(s) and computes the expression within each partition, then broadcasts the result back to the original row count. No transform or merge gymnastics required.

Conditional expressions use when/then/otherwise:

df = df.with_columns(
    pl.when(pl.col("years") >= 5)
    .then(pl.lit("Senior"))
    .when(pl.col("years") >= 2)
    .then(pl.lit("Mid"))
    .otherwise(pl.lit("Junior"))
    .alias("level")
)

Joins follow SQL semantics with explicit join types:

departments = pl.DataFrame({
    "department": ["Engineering", "Marketing"],
    "budget": [500000, 200000],
})

joined = df.join(departments, on="department", how="left")

Polars supports inner, left, right, full, cross, semi, and anti joins. The join implementation uses hash-based algorithms and runs in parallel.

The SQL Interface: A Bridge for SQL-First Teams

One of the questions that never gets asked enough when evaluating Polars: what happens to the SQL-fluent analyst who has no interest in learning a Python expression syntax they didn't ask for? Polars has a real answer — built-in SQL support that translates queries into expressions and executes them through the same query engine.

Polars does not have a separate SQL engine. Instead, it translates SQL into its expression IR and runs it through the same optimizer and execution engine. The SQL layer is not a second-class workaround — it benefits from the same predicate pushdown, projection pushdown, and parallelism that the expression API delivers.

The primary interface is pl.SQLContext. You register DataFrames or LazyFrames as named tables, then execute SQL queries against them:

import polars as pl

orders = pl.scan_parquet("orders.parquet")
products = pl.scan_parquet("products.parquet")

ctx = pl.SQLContext(orders=orders, products=products)

result = ctx.execute("""
    SELECT
        p.category,
        SUM(o.revenue) AS total_revenue,
        COUNT(*) AS order_count
    FROM orders AS o
    JOIN products AS p ON o.product_id = p.id
    WHERE o.status = 'completed'
    GROUP BY p.category
    ORDER BY total_revenue DESC
""").collect()

For ad-hoc queries, pl.sql() is even more direct — it automatically discovers DataFrames in the local scope by variable name:

df = pl.scan_csv("employees.csv")

result = pl.sql("""
    SELECT department, AVG(salary) AS avg_salary
    FROM df
    GROUP BY department
""", eager=True)

You can also mix and match. Run a SQL query to express a join or filter, then continue chaining Polars expressions on the result. The output of ctx.execute() is a LazyFrame, so the full expression API remains available. This hybrid approach is particularly useful in teams transitioning from SQL-heavy workflows: keep the SQL for the logic that's already written and tested, and use Polars expressions for the transformation patterns that benefit from the richer functional API.

SQL Coverage

Polars supports the most commonly used SQL subset: SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, all standard join types, UNION, WITH (CTEs), and EXPLAIN. It does not implement the full SQL specification. Where syntax diverges, Polars follows PostgreSQL conventions. As of version 1.x, the Polars documentation lists SQL support as an officially supported feature, though the expression API remains the canonical path for new features.

A question worth asking explicitly: does this make the earlier advice about DuckDB for SQL-heavy teams obsolete? Not entirely. DuckDB offers a more complete SQL surface area, broader dialect compatibility, and a stronger story for teams that want SQL as their only interface. But the gap has narrowed considerably. Teams that want to do 90% of their work in SQL and reach for Python expressions when they need them now have a first-class path in Polars itself, without adding another runtime dependency.

Expression Plugins: Extending Polars in Rust

The previous section on migration gotchas identified the real problem with .map_elements(): it re-enters Python on every row, destroying the performance characteristics that make Polars valuable. The standard advice is to replace custom logic with native Polars expressions. That advice is correct, but it has a ceiling. Some computations are genuinely not expressible as combinations of built-in Polars operations — proprietary scoring models, specialized signal processing, domain-specific string transformations that go beyond .str.* patterns. What do you do then?

The answer is expression plugins: a first-class mechanism for compiling custom logic in Rust and registering it as a native Polars expression. The plugin runs inside the Rust engine with no Python interpreter overhead, benefits from the same parallelism as built-in expressions, and participates in the query optimizer. For elementwise operations, Polars can determine at plan time that the function is parallelizable, and will distribute the work across cores automatically.

The toolchain requires Rust and maturin (a Rust/Python packaging bridge), but the boilerplate is minimal. The official Polars documentation provides a full tutorial using a Pig Latin converter as a teaching example. In practice, a real plugin might look like this:

# After building the Rust plugin with maturin:
import polars as pl
import my_text_plugin as tp

# The plugin registers a new expression namespace
result = df.with_columns(
    tp.compute_domain_score(pl.col("url")).alias("score"),
    tp.normalize_phone(pl.col("phone_raw")).alias("phone"),
)

A community ecosystem of published plugins has formed around this mechanism. The Polars documentation maintains a curated list that includes polars-xdt for extended datetime operations, polars-distance for pairwise distance functions, polars-ds for data analysis operations, and polars-st for spatial operations modeled on GeoPandas and Shapely. These are distributed as regular Python packages — pip install polars-xdt — and they work transparently with Polars Cloud's remote execution.

The plugin architecture also resolves a deeper design tension. Polars keeps its core library lean and deliberately out of scope for highly specialized domains. Rather than absorbing geospatial operations, genomics processing, or financial quant operations into the main library, it provides a stable, performant extension point for domain-specific packages. The result is an ecosystem that grows independently while the core remains focused and fast.

Decision Framework

When custom logic is needed: first, try to express it using native Polars expressions. If that fails, check whether a community plugin already exists for the domain. If not, write an expression plugin in Rust — the performance will be equivalent to a built-in expression. Use .map_elements() only for genuine prototyping or when the operation is so infrequent that performance is not a concern. Never use it in a tight loop on a large column in a production pipeline.

Lakehouse Integration: Delta Lake and Iceberg

A question that rarely appears in introductory Polars material but becomes unavoidable in production: what happens when the data is not a CSV on a local filesystem, but a Delta Lake or Iceberg table sitting in cloud object storage? This is the normal state of affairs for data engineering teams in 2026, and Polars has first-class answers for both formats.

Delta Lake integration works through the deltalake Python package (delta-rs, a Rust implementation of the Delta Lake protocol with no Spark dependency). You can scan a Delta table into a Polars LazyFrame and pass it through the full lazy pipeline:

import polars as pl
from deltalake import DeltaTable

# Scan a Delta table — all the lazy optimizations apply
dt = DeltaTable("s3://my-bucket/events-table")
result = (
    pl.scan_delta("s3://my-bucket/events-table")
    .filter(pl.col("event_date") >= "2025-01-01")
    .group_by("user_id")
    .agg(pl.col("event_type").count().alias("event_count"))
    .collect(engine="streaming")
)

# Write back to Delta
result.write_delta("s3://my-bucket/event-counts", mode="overwrite")

The integration is more capable than it initially appears. Polars can read specific Delta table versions for time travel (version=N), query the transaction log to determine which Parquet files are relevant before reading any data, and stream output directly to a Delta sink without collecting into memory. For teams already running Delta-backed lakehouses on Databricks or Unity Catalog, this means the subset of workloads that fit on a single machine can be run with Polars at dramatically lower infrastructure cost than a Databricks job.

An independent benchmark by Daniel Beach at Data Engineering Central in November 2025 demonstrated Polars, DuckDB, and Daft each processing 650 GB of Delta Lake data on a single 32 GB EC2 instance in streaming mode — completing successfully where a naive in-memory approach would have crashed. The comparison against PySpark on a single Databricks node illustrated how capable single-node tools have become for workloads that would historically have required a full cluster.

Apache Iceberg support works through the pyiceberg package. Polars can read Iceberg tables via pl.scan_iceberg(), and the same predicate pushdown and projection pushdown mechanics apply. Iceberg's partition evolution and hidden partitioning features are handled at the catalog layer — Polars sees a clean table interface and issues optimized reads through it:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")
table = catalog.load_table("analytics.page_views")

result = (
    pl.scan_iceberg(table)
    .filter(pl.col("region") == "EU")
    .select(["session_id", "page_url", "duration_ms"])
    .collect()
)

The practical implication is significant. Polars is no longer just a tool for processing flat files. It is a viable single-node query engine for production lakehouse workloads — not as a replacement for distributed Spark on petabyte-scale jobs, but as a compelling alternative for the substantial portion of lakehouse queries that fit within the throughput of a well-provisioned single machine. The cost and complexity difference can be dramatic: a single r6i.8xlarge running Polars versus a ten-node EMR cluster running Spark, for the same query on a 200 GB table, favors Polars on both axes.

Install for Lakehouse Formats

Delta Lake support requires pip install polars[deltalake]. Iceberg support requires pip install polars[pyiceberg]. Cloud storage access (S3, GCS, Azure Data Lake) requires additional filesystem libraries, typically via pip install polars[fsspec] or the cloud-specific extras.

The Ecosystem and Where Polars Is Heading

Polars has evolved from a single-developer side project to a funded company with an expanding ecosystem. In September 2025, Polars raised an 18 million euro Series A led by Accel, with participation from Bain Capital Ventures. The Series A announcement noted growth from 250,000 to over 23 million monthly users since the earlier $4 million seed round in 2023.

Earlier that same month, the company launched Polars Cloud as Generally Available on AWS — a managed platform where you call .remote(ctx) on a LazyFrame to execute queries on provisioned cloud compute using the same API you use locally. Polars Distributed, the multi-node engine targeting petabyte-scale workloads, launched simultaneously as an open beta on Polars Cloud. With Polars Distributed, the company is positioning itself as a direct cost and complexity alternative to Apache Spark and managed Spark platforms like Databricks and AWS Glue. Independent reviewers have noted pricing around $0.05 per vCPU/hour, which the company has positioned directly against Glue's per-DPU pricing.

The integration surface beyond Polars Cloud has also expanded substantially. Polars ships native connectors for Delta Lake and Apache Iceberg, the two dominant open table formats in production data lakehouses. These are covered in detail in the Lakehouse Integration section above — the short version is that Polars can now function as a complete single-node query engine over cloud-hosted lakehouse tables, not just a tool for processing local files.

A growing validation ecosystem has formed around Polars DataFrames. Libraries including Pandera, Patito (which launched with native Polars support in late 2022), Dataframely, and Pointblank each approach data quality from different angles — from type-safe schema validation to stakeholder-facing quality reports. This ecosystem reflects how seriously teams are taking Polars in production pipelines, where data quality gates are non-negotiable.

The first O'Reilly book on the library, Python Polars: The Definitive Guide by Jeroen Janssens and Thijs Nieuwdorp, was published in April 2025, with a foreword by Ritchie Vink. The Python package on PyPI shows a latest release of version 1.38.1 in February 2026, with the library now requiring Python 3.10 or higher and licensed under MIT.

When Not to Use Polars

Polars is not the right tool for every situation, and being honest about its limits matters as much as understanding its strengths.

Small datasets (under ~100,000 rows): The performance difference over Pandas is negligible at this scale, and Pandas' larger ecosystem of tutorials, integrations, and community answers may make it the pragmatic choice. More importantly: if you're spending engineering time optimizing a pipeline that runs in 200ms, the performance delta isn't worth the migration cost.

Heavy reliance on the Pandas ecosystem: Historically, libraries like scikit-learn, statsmodels, and many visualization tools expected Pandas DataFrames as inputs. scikit-learn added Polars support in version 1.4, allowing transformers to accept Polars DataFrames as input and return Polars output via set_output(transform="polars"), so that specific friction point has diminished. But statsmodels, many plotting libraries, and a large share of internal tooling in mature organizations still assume Pandas objects. You can convert between Polars and Pandas with df.to_pandas() and pl.from_pandas(pandas_df), but if your entire pipeline is tightly coupled to Pandas objects throughout — not just at data ingestion — the friction is real. The deeper solution here is to use Polars for the heavy transformation work and convert only at the boundaries where necessary, rather than trying to go all-in or not at all.

Distributed computing (today): Polars Distributed is in open beta on AWS via Polars Cloud. For production petabyte-scale distributed workloads where you need battle-tested fault tolerance and a mature operational story, PySpark and Databricks remain the safer bets today. What's worth watching is whether Polars Distributed matures quickly enough to challenge that assumption in the next 12 to 18 months — the Decathlon case study from September 2025, which documented migrating Spark workloads to Polars to reduce infrastructure cost and complexity, suggests the trajectory is real.

SQL-heavy teams: This picture has changed in recent versions. Polars now ships a first-class SQL interface via pl.SQLContext and pl.sql() that translates SQL into expressions and runs them through the same engine. Teams that want to work primarily in SQL can do so without leaving Polars. That said, DuckDB remains the stronger choice if the requirement is near-complete SQL specification coverage, broad dialect compatibility, or a JDBC/ODBC interface for BI tools. DuckDB and Polars are also interoperable: you can query a Polars DataFrame directly from DuckDB with zero copying, because they share the Arrow memory model. A hybrid approach — DuckDB for complex analytical SQL, Polars for programmatic Python transformations — is a legitimate and increasingly common architecture. But teams that simply prefer SQL syntax over the expression API now have a credible path entirely within Polars.

Iterative row-by-row processing: If the core of your algorithm requires processing rows one at a time with state that depends on the previous row — certain simulation patterns, stateful event processing, or algorithms that haven't yet been vectorized — Polars won't help you much. The expression model assumes columnar, set-based computation. Forcing row iteration via .map_elements() negates the performance benefits. In these cases, Polars or Pandas make sense for the data loading and preprocessing stages, but a different approach (custom Rust extension, Numba, compiled C extension) is needed for the iterative kernel itself.

A Complete Practical Example

Here is a realistic data pipeline that demonstrates Polars' strengths in a single, cohesive script:

"""
Analyze web server logs: find the top endpoints by error rate,
compute hourly traffic patterns, and identify suspicious IP addresses.
"""
import polars as pl

# Lazy scan: reads only needed columns, pushes filters down
logs = pl.scan_csv(
    "server_logs.csv",
    try_parse_dates=True,
    dtypes={"status_code": pl.UInt16, "response_bytes": pl.UInt32},
)

# Pipeline 1: Error rates by endpoint
error_rates = (
    logs
    .group_by("endpoint")
    .agg(
        pl.len().alias("total_requests"),
        (pl.col("status_code") >= 400).sum().alias("error_count"),
    )
    .with_columns(
        (pl.col("error_count") / pl.col("total_requests") * 100)
        .round(2)
        .alias("error_rate_pct")
    )
    .filter(pl.col("total_requests") > 100)
    .sort("error_rate_pct", descending=True)
    .head(20)
)

# Pipeline 2: Hourly traffic
hourly_traffic = (
    logs
    .with_columns(pl.col("timestamp").dt.hour().alias("hour"))
    .group_by("hour")
    .agg(
        pl.len().alias("requests"),
        pl.col("response_bytes").sum().alias("total_bytes"),
        pl.col("response_time_ms").quantile(0.95).alias("p95_latency"),
    )
    .sort("hour")
)

# Pipeline 3: Suspicious IPs (high error rate + high volume)
suspicious_ips = (
    logs
    .group_by("client_ip")
    .agg(
        pl.len().alias("request_count"),
        (pl.col("status_code") >= 400).mean().alias("error_fraction"),
    )
    .filter(
        (pl.col("request_count") > 50)
        & (pl.col("error_fraction") > 0.5)
    )
    .sort("request_count", descending=True)
)

# Execute all three lazily, benefiting from common subplan elimination
results = pl.collect_all([error_rates, hourly_traffic, suspicious_ips])

top_errors, traffic, bad_ips = results

print("Top Error Endpoints:")
print(top_errors)
print("\nHourly Traffic:")
print(traffic)
print("\nSuspicious IPs:")
print(bad_ips)

The pl.collect_all() call is significant. When you pass multiple LazyFrames that share a common data source, the optimizer applies Common Subplan Elimination: it identifies the shared scan_csv operation and executes it once, caching the result for all three downstream pipelines. Three analytical queries, one file read, automatic parallelism, and query optimization, all in roughly 50 lines of Python.

The Trajectory

Writing on the Polars blog in September 2025, Ritchie Vink described how the project's scope had expanded from building a faster DataFrame library into building a state-of-the-art single-node query engine specialized for DataFrames — and further still into a one-DataFrame-API vision that scales from a laptop to a distributed cloud cluster.

Whether that vision fully materializes remains to be seen, but the trajectory from a COVID-lockdown side project to over 23 million monthly users in five years — with a funded company, GPU acceleration, and a live cloud platform — suggests the approach is resonating far beyond performance benchmarks.

There is a deeper question worth sitting with: what does Polars' success signal about where the Python data stack is heading? For most of Pandas' lifespan, the assumed path for workloads that outgrew a single machine was to move to Spark — a fundamentally different programming model, a different operations footprint, and a different skill set. Polars, DuckDB, and their contemporaries are arguing for a different answer: that the single machine has become powerful enough, and the engines smart enough, that distributed computing should be the exception rather than the default. Polars Distributed and Polars Cloud are bets that when you do eventually need distribution, you should be able to reach it from the same API and the same mental model, rather than rewriting your pipeline in PySpark.

That argument is not settled. But the fact that a single Polars process can now read a Delta Lake table from S3 in streaming mode, process hundreds of gigabytes on a modest machine, query it with SQL or Python expressions interchangeably, validate output with a data quality library, write results back to an Iceberg table — and do all of this faster and cheaper than a Spark cluster would — changes the set of questions you need to ask before reaching for distributed infrastructure.

For Python developers working with data today, the practical takeaway is concrete: if your DataFrames are large enough that you notice Pandas struggling, or if your pipeline reads from files and applies transformations that a query optimizer could improve, Polars is worth learning. The API is clean, the documentation is solid, and the performance gains are not theoretical. They show up the first time you replace read_csv with scan_csv and add .collect() at the end. And if your dataset outgrows RAM, collect(engine="streaming") is one argument away from not crashing.

Start lazy. Let the engine do the work.

back to articles