The pandas DataFrame is the workhorse of Python data analysis. It is a two-dimensional, labeled data structure capable of holding mixed types across columns, and it is the object you will spend more time with than any other when doing serious data work in Python. This guide covers how DataFrames actually work, how to manipulate them correctly, and what has changed with the release of pandas 3.0 — because some patterns you learned a year ago will silently break your code today.
pandas is not a built-in Python library. You install it separately, and it sits on top of NumPy, providing a higher-level interface designed for tabular, labeled data. The library was originally created by Wes McKinney at AQR Capital Management in 2008 and open-sourced in 2009. The name comes from "panel data," a term used in econometrics for multi-dimensional data sets involving measurements over time. Since then it has grown into one of the dominant tools for data analysis in Python, used by data scientists, analysts, and engineers across virtually every industry.
The current stable release as of this writing is pandas 2.3.3 (released September 29, 2025), and pandas 3.0.0 arrived on January 21, 2026, bringing the most significant behavioral change in years: Copy-on-Write is now the default. If you are upgrading from an older version, you need to understand that change. This guide addresses it directly in its own section.
What a DataFrame Is (and What It Is Not)
A DataFrame is best understood as a dictionary of Series objects that share the same index. Each column is a pd.Series, each Series has a dtype, and all of them are aligned on a common row index. The index does not have to be integers. It can be strings, dates, or any hashable Python type. That alignment is what makes vectorized operations across columns possible without explicit loops.
The pandas documentation describes the DataFrame as providing "automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations." This is one of the core properties that separates pandas from plain NumPy arrays — NumPy knows nothing about row labels.
A DataFrame is not a database table, even though it resembles one. It is an in-memory structure. It does not enforce uniqueness on row labels, it does not have transactions, and it is not thread-safe. The pandas documentation explicitly cautions: "Since pandas is not thread safe, see the gotchas when copying in a threading environment." Keeping this in mind will save you from building systems that seem to work but fail under concurrent load.
pandas DataFrames are stored in column-major order internally. Reading an entire column is fast; reading an entire row requires touching multiple underlying arrays. Design your operations to work column-wise whenever possible.
Creating and Inspecting DataFrames
You can create a DataFrame from a dictionary, a list of dictionaries, a NumPy array, a CSV file, a database query, or a wide range of other sources. The most common patterns in practice are from a dictionary of lists and from pd.read_csv().
import pandas as pd
# From a dictionary of lists
df = pd.DataFrame({
"product": ["Widget A", "Widget B", "Widget C", "Widget D"],
"region": ["North", "South", "North", "West"],
"revenue": [12400, 9800, 15200, 7600],
"units": [124, 98, 152, 76]
})
# From a CSV file
df = pd.read_csv("sales_data.csv")
# From a list of dicts (each dict is one row)
rows = [
{"product": "Widget A", "revenue": 12400},
{"product": "Widget B", "revenue": 9800},
]
df = pd.DataFrame(rows)
Once you have a DataFrame, the first thing you should do is get a sense of its shape and content. A small cluster of methods covers nearly all initial inspection needs:
df.shape # (rows, columns) as a tuple
df.dtypes # dtype of each column
df.info() # dtypes + non-null count + memory usage
df.describe() # summary statistics for numeric columns
df.head(10) # first 10 rows (default is 5)
df.tail(5) # last 5 rows
df.columns # column labels as an Index object
df.index # row labels as an Index object
df.info() is underused. It shows you the dtype of each column and the number of non-null values per column in a single call. When you load a CSV and a column you expect to be numeric shows as object, info() is usually the fastest way to spot the problem. A column that contains even one non-numeric string will be inferred as object dtype by the CSV reader.
Understanding dtypes
Every column in a DataFrame has a dtype. The most common ones you will encounter are int64, float64, bool, object (which means "Python objects," typically strings), datetime64[ns], and category. Starting with pandas 3.0, string columns are inferred as the new dedicated StringDtype rather than object when pd.options.future.infer_string is set to True, and this will become the unconditional default in an upcoming release.
The dtype matters for performance and for which operations are available. You can always check or change a dtype explicitly:
# Check a column's dtype
print(df["revenue"].dtype) # int64
# Cast a column
df["revenue"] = df["revenue"].astype(float)
# Convert a string column to categorical
# (dramatically reduces memory for low-cardinality columns)
df["region"] = df["region"].astype("category")
# Convert to datetime
df["sale_date"] = pd.to_datetime(df["sale_date"])
For columns that hold a small, fixed set of string values — like "region," "status," or "category" — converting to category dtype can reduce memory usage by 90% or more and significantly speed up groupby operations. Use df["col"].nunique() to check cardinality before deciding.
Selecting and Indexing Data
Selecting data from a DataFrame is one of the areas where new users get confused the fastest, because pandas offers multiple syntaxes that look similar but behave differently in edge cases. The two you should rely on for almost everything are .loc[] and .iloc[].
Column selection
# Single column -> returns a Series
s = df["revenue"]
# Multiple columns -> returns a DataFrame
subset = df[["product", "revenue"]]
# Boolean column selection (filter rows)
high_revenue = df[df["revenue"] > 10000]
# Filter with multiple conditions
north_high = df[(df["region"] == "North") & (df["revenue"] > 10000)]
loc and iloc
.loc[] is label-based. You use the actual index labels and column names. .iloc[] is position-based. You use integer positions, just like Python list slicing. Both accept a row selector and an optional column selector separated by a comma.
# loc: label-based
df.loc[0, "revenue"] # single value at row label 0, column "revenue"
df.loc[0:2, "product":"revenue"] # rows 0-2 inclusive, columns product through revenue
df.loc[df["region"] == "North"] # boolean row selection
# iloc: position-based
df.iloc[0, 2] # row 0, column 2 (by integer position)
df.iloc[0:3, 1:4] # rows 0-2, columns 1-3 (exclusive upper bound, like Python slices)
df.iloc[-1] # last row
# Setting values with loc (the correct approach)
df.loc[df["region"] == "North", "revenue"] = df.loc[df["region"] == "North", "revenue"] * 1.05
Avoid df["col"][condition] = value. This is called chained assignment. In pandas 3.0 with Copy-on-Write as the default, it will silently do nothing rather than modify your DataFrame. Always use df.loc[condition, "col"] = value to assign in a single step. See the Copy-on-Write section for full details.
The query() method
For complex boolean filters, .query() can make the intent clearer than stacking conditions in brackets. It accepts a string expression evaluated against the column names:
# Equivalent to df[(df["region"] == "North") & (df["revenue"] > 10000)]
result = df.query("region == 'North' and revenue > 10000")
# Reference an external variable with @
threshold = 10000
result = df.query("revenue > @threshold")
Cleaning and Transforming Data
Real data is never clean. Handling missing values, fixing dtypes, renaming columns, and adding derived columns are the tasks that consume the largest share of a data analyst's time. pandas provides clean, composable tools for all of them.
Handling missing values
# Detect missing values
df.isnull() # DataFrame of booleans
df.isnull().sum() # count of NaN per column
# Drop rows with any NaN
df_clean = df.dropna()
# Drop rows only if all values are NaN
df_clean = df.dropna(how="all")
# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=["revenue", "product"])
# Fill NaN with a fixed value
df["revenue"] = df["revenue"].fillna(0)
# Fill NaN with column mean
df["revenue"] = df["revenue"].fillna(df["revenue"].mean())
# Forward-fill (useful for time series)
df["price"] = df["price"].ffill()
Adding and renaming columns
# Add a new column by arithmetic
df["revenue_per_unit"] = df["revenue"] / df["units"]
# Add multiple columns at once with assign()
# assign() returns a new DataFrame; it does not modify in place
df = df.assign(
revenue_per_unit=lambda x: x["revenue"] / x["units"],
is_high_value=lambda x: x["revenue"] > 10000
)
# Rename columns
df = df.rename(columns={"revenue": "total_revenue", "units": "units_sold"})
# Rename all columns at once
df.columns = ["product", "region", "total_revenue", "units_sold"]
# Drop columns
df = df.drop(columns=["units_sold"])
In pandas 3.0, a new expression syntax was introduced via pd.col(), which allows you to reference columns without using lambda functions inside assign(). The syntax is borrowed from PySpark and Polars:
# pandas 3.0+: pd.col() syntax inside assign()
df = df.assign(revenue_per_unit=pd.col("revenue") / pd.col("units"))
Applying functions
# Apply a function element-wise to a column
df["product_upper"] = df["product"].str.upper()
# Apply an arbitrary function to each row (axis=1) or column (axis=0)
df["combined"] = df.apply(lambda row: f"{row['product']} - {row['region']}", axis=1)
# map() for element-wise transformations on a Series
region_map = {"North": "N", "South": "S", "West": "W"}
df["region_code"] = df["region"].map(region_map)
# String methods via the .str accessor
df["product_lower"] = df["product"].str.lower()
df["starts_with_w"] = df["product"].str.startswith("Widget")
Avoid apply() with axis=1 for large DataFrames. It is a Python-level loop and is orders of magnitude slower than vectorized column operations. Most things you want to do row-wise can be expressed as column arithmetic or string accessor operations instead.
Sorting
# Sort by one column
df_sorted = df.sort_values("revenue", ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(["region", "revenue"], ascending=[True, False])
# Sort by index
df_sorted = df.sort_index()
Grouping and Aggregating
The groupby() method implements the split-apply-combine pattern: it splits the DataFrame into groups based on one or more columns, applies a function to each group independently, and then combines the results back into a single output. The pandas documentation describes it plainly: "A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups."
"The abstract definition of grouping is to provide a mapping of labels to group names." — pandas User Guide, Group by: split-apply-combine
A groupby() call does not immediately compute anything. It creates a GroupBy object that is lazy — execution happens when you chain an aggregation, transformation, or filter.
# Basic groupby with a single aggregation
region_revenue = df.groupby("region")["revenue"].sum()
# Multiple aggregations at once using agg()
summary = df.groupby("region")["revenue"].agg(["sum", "mean", "count"])
# Aggregate different columns with different functions
summary = df.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_units=("units", "mean"),
product_count=("product", "count")
)
# Group by multiple columns
summary = df.groupby(["region", "product"])["revenue"].sum()
# Reset the index after groupby to get a flat DataFrame
summary = df.groupby("region")["revenue"].sum().reset_index()
transform() vs agg()
This distinction trips up a lot of people. agg() collapses each group to a summary statistic, producing a smaller output with one row per group. transform() applies a function to each group but returns an output with the same shape as the input, making it useful for adding a column to the original DataFrame that references group-level statistics.
# agg(): one row per group
group_totals = df.groupby("region")["revenue"].agg("sum")
# group_totals has 3 rows (one per region)
# transform(): same shape as original DataFrame
# Add each row's revenue as a fraction of its region's total
df["pct_of_region"] = (
df["revenue"] / df.groupby("region")["revenue"].transform("sum") * 100
)
# Standardize revenue within each region (z-score)
df["revenue_zscore"] = df.groupby("region")["revenue"].transform(
lambda x: (x - x.mean()) / x.std()
)
filter()
The filter() method keeps only the groups that pass a boolean test on the group as a whole. It returns rows from the original DataFrame, not aggregated values:
# Keep only rows where the group has more than one member
df_filtered = df.groupby("region").filter(lambda x: len(x) > 1)
# Keep only groups where total revenue exceeds 20000
df_filtered = df.groupby("region").filter(lambda x: x["revenue"].sum() > 20000)
As of pandas 2.0, the observed parameter for groupby() defaults to True, meaning only observed categories are included in the output when grouping by a column with category dtype. In older versions the default was False, which included all category levels even if no rows matched. If your grouped output is missing rows you expect, check this parameter.
Merging, Joining, and Concatenating
pandas provides three related but distinct operations for combining DataFrames: pd.merge() for SQL-style joins, DataFrame.join() for index-based joining, and pd.concat() for stacking DataFrames along an axis.
pd.merge()
pd.merge() is the most general-purpose tool. It joins two DataFrames based on shared column values, similar to a SQL JOIN. The how parameter controls the join type: "inner", "left", "right", or "outer".
products = pd.DataFrame({
"product_id": [1, 2, 3],
"product_name": ["Widget A", "Widget B", "Widget C"]
})
sales = pd.DataFrame({
"product_id": [1, 1, 2, 4],
"sale_amount": [500, 600, 400, 300]
})
# Inner join: only rows where product_id exists in both
inner = pd.merge(sales, products, on="product_id", how="inner")
# Left join: all rows from sales, NaN where no match in products
left = pd.merge(sales, products, on="product_id", how="left")
# Merge on columns with different names
merged = pd.merge(sales, products,
left_on="product_id", right_on="product_id",
how="inner")
# Detect duplicate columns with suffixes
merged = pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))
pd.concat()
pd.concat() stacks DataFrames vertically (adding rows) or horizontally (adding columns). It is the right tool when you have multiple DataFrames with the same schema that you want to combine into one.
# Stack rows vertically (axis=0, the default)
df_all = pd.concat([df_q1, df_q2, df_q3, df_q4], ignore_index=True)
# Stack columns horizontally
df_wide = pd.concat([df_features, df_targets], axis=1)
# Add a key to identify which original DataFrame each row came from
df_all = pd.concat(
[df_q1, df_q2],
keys=["Q1", "Q2"]
)
Never use pd.concat() inside a loop, appending to a growing DataFrame one row at a time. Each call copies all existing data. Instead, collect your DataFrames in a list and call pd.concat(list_of_dfs) once at the end. This is one of the most common performance anti-patterns in pandas code.
Pivot Tables
A pivot table reshapes a DataFrame so that unique values in one column become the row labels, unique values in another column become the column headers, and a third column's values are aggregated at each intersection. The pandas implementation closely mirrors Excel pivot tables, which makes it intuitive for analysts with that background.
"pivot_table can be used to create spreadsheet-style pivot tables." — pandas User Guide, Reshaping and pivot tables
import pandas as pd
sales = pd.DataFrame({
"region": ["North", "North", "South", "South", "West", "West"],
"product": ["Widget A", "Widget B", "Widget A", "Widget B", "Widget A", "Widget B"],
"revenue": [12400, 9800, 8700, 11200, 7600, 6400]
})
# Basic pivot: total revenue by region and product
table = pd.pivot_table(
sales,
values="revenue",
index="region",
columns="product",
aggfunc="sum"
)
# Add row and column totals with margins=True
table = pd.pivot_table(
sales,
values="revenue",
index="region",
columns="product",
aggfunc="sum",
margins=True,
margins_name="Total"
)
# Fill NaN cells (no data for that combination) with 0
table = pd.pivot_table(
sales,
values="revenue",
index="region",
columns="product",
aggfunc="sum",
fill_value=0
)
# Multiple aggregation functions at once
table = pd.pivot_table(
sales,
values="revenue",
index="region",
columns="product",
aggfunc=["sum", "mean"]
)
pivot() vs pivot_table()
DataFrame.pivot() is the simpler sibling. It reshapes data without performing any aggregation. Because no aggregation is happening, it requires that the combination of index and column values be unique. If your data has duplicate row-column pairs, pivot() raises a ValueError. That is when you need pivot_table() instead, which handles duplicates by aggregating them.
# pivot() requires uniqueness; use when one row per index-column combination
pivoted = df.pivot(index="date", columns="variable", values="value")
# pivot_table() handles duplicates
table = pd.pivot_table(df, values="value", index="date",
columns="variable", aggfunc="mean")
melt(): the inverse operation
pd.melt() (or DataFrame.melt()) reverses a pivot. It takes a wide-format DataFrame and converts it to long format, which is often required by visualization libraries and statistical tools that expect one observation per row.
# wide -> long
df_long = pd.melt(
table.reset_index(),
id_vars=["region"],
value_vars=["Widget A", "Widget B"],
var_name="product",
value_name="revenue"
)
Copy-on-Write in pandas 3.0
This is the change that will affect more existing code than anything else in pandas 3.0. Copy-on-Write (CoW) is now the default and only mode. The official documentation states it plainly: "Copy-on-Write is now the default with pandas 3.0." If you are using pandas 2.x or earlier, you were living with an indexing model where some operations returned views of the original data and others returned copies, and it was often unclear which you had.
The practical consequence was bugs like this:
# pandas < 3.0 behavior (unpredictable)
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
# In some cases, df["foo"] was also modified because subset was a view.
# In other cases it was not. Behavior depended on the exact operation.
The pandas 3.0 release notes describe the goal of the fix: "There is now a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original." Under CoW, the ambiguity is gone. A derived object never modifies its parent, regardless of how it was created.
The implementation is smarter than it sounds. The documentation explains that "every new DataFrame or Series returned from an indexing operation or method behaves as a copy, but under the hood pandas will use views as much as possible, and only copy when needed to guarantee the 'behaves as a copy' behaviour." The actual copy is deferred until a write occurs. This means CoW often improves average performance and memory usage rather than hurting it, because unnecessary eager copies are eliminated.
What breaks under CoW
The pattern that stops working is chained assignment: using two back-to-back indexing steps to set a value.
# This NO LONGER modifies df in pandas 3.0:
df["foo"][df["bar"] > 5] = 100
# df["foo"] is now a copy; assigning to it does nothing to df.
# The correct replacement is a single loc step:
df.loc[df["bar"] > 5, "foo"] = 100
The old two-step pattern worked sometimes in older pandas because df["foo"] happened to return a view. Under CoW, every indexing step returns something that behaves as a copy, so the assignment goes to an object that is immediately discarded. There is no error message; the code just does not do what you intended. The pandas docs note: "Because every single indexing step now behaves as a copy, this also means that 'chained assignment' (updating a DataFrame with multiple setitem steps) will stop working."
"The main goal of this change is to make the user API more consistent and predictable." — pandas 3.0.0 Release Notes
How to migrate
The migration path recommended by the pandas team is to upgrade to pandas 2.3 first, enable CoW explicitly with pd.options.mode.copy_on_write = True, and run your code to observe the warnings. Patterns that will break under pandas 3.0 emit ChainedAssignmentError warnings in that mode. Once those are resolved, upgrading to 3.0 should be straightforward.
# Enable CoW in pandas 2.x to preview 3.0 behavior
pd.options.mode.copy_on_write = True
# The pattern to adopt everywhere:
# Instead of: df["col"][mask] = value
# Use:
df.loc[mask, "col"] = value
# Instead of: df[df["x"] > 0]["y"] = 1
# Use:
df.loc[df["x"] > 0, "y"] = 1
In pandas 3.0, the option mode.copy_on_write no longer has any effect and is deprecated. Setting it will not switch back to old behavior. The only supported mode is CoW. The option will be removed entirely in pandas 4.0.
Key Takeaways
- Use
.loc[]for all assignments. Chained assignment (df["col"][mask] = value) silently fails in pandas 3.0 with Copy-on-Write. The rule is simple: if you are writing a value back into a DataFrame, use a singledf.loc[row_condition, "col"] = valueexpression. - Know when to use
agg()vstransform(). If you need a smaller summary table with one row per group, useagg(). If you need to add a column to your original DataFrame that references group-level statistics, usetransform(). Mixing them up produces incorrect shapes. - Prefer vectorized operations over
apply(axis=1). Row-wiseapply()is a Python loop. For large DataFrames it can be 100x or more slower than equivalent column arithmetic or string accessor operations. Reach for it only when no vectorized alternative exists. - Convert low-cardinality string columns to
categorydtype. This applies to any column that holds a small, fixed set of values. Memory usage drops dramatically andgroupby()operations run faster as a result. - Build lists of DataFrames, then concatenate once. Accumulating rows with repeated
pd.concat()inside a loop copies all existing data on every iteration. Collect the pieces in a Python list and callpd.concat(list_of_dfs, ignore_index=True)once at the end. - Use
pivot_table()overpivot()when your data can have duplicates.pivot()raises an error on duplicate index-column combinations.pivot_table()handles them by aggregating, and its parameters map closely to what you already know from Excel.
The pandas DataFrame is a remarkably capable structure, and its surface area is large enough that no single article can cover every method. What this guide has tried to give you is a precise mental model of how the core operations work, grounded in what the documentation and release notes actually say, including the behavioral changes that shipped with pandas 3.0. The patterns described here — label-based selection, groupby with transform, correct assignment via loc, CoW-safe code — will serve you across any version of pandas you are likely to encounter in production.
Sources
- pandas Development Team. What's new in 3.0.0 (January 21, 2026). pandas.pydata.org.
- pandas Development Team. Copy-on-Write (CoW) — pandas 3.0.1 documentation. pandas.pydata.org.
- pandas Development Team. Group by: split-apply-combine — pandas 3.0.1 documentation. pandas.pydata.org.
- pandas Development Team. Reshaping and pivot tables — pandas 3.0.1 documentation. pandas.pydata.org.
- pandas Development Team. pandas.DataFrame.groupby — pandas 3.0.1 documentation. pandas.pydata.org.
- pandas Development Team. pandas.pivot_table — pandas 3.0.1 documentation. pandas.pydata.org.
- pandas Development Team. What's new in 2.3.0 (June 4, 2025). pandas.pydata.org.
- Schmucker, Patrick. What's New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings. CodeCut, January 29, 2026.
- pandas Development Team. pandas — PyPI. pypi.org.