Data Wrangling With Python: Turning Raw Data Into Something You Can Actually Use

Data wrangling — also called data munging — is the process of taking messy, incomplete, or oddly structured raw data and transforming it into a format that is clean, consistent, and ready for analysis. If you have worked with real-world datasets for more than about fifteen minutes, you already know the problem.

Column names are inconsistent. Dates arrive in six different formats. Half the rows are missing values. Entire fields that should be numeric are secretly strings because someone typed “N/A” instead of leaving the cell blank.

This is not a niche concern. Industry estimates consistently suggest that data professionals spend somewhere between 50 and 80 percent of their total project time on wrangling tasks. The New York Times captured this reality in a now-famous 2014 headline calling this kind of work the “janitor work” that serves as the key hurdle to extracting insights. And Hadley Wickham, Chief Scientist at Posit (formerly RStudio) and the mind behind the influential tidy data framework, summed up the philosophical problem neatly in his 2014 paper published in the Journal of Statistical Software:

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham, Journal of Statistical Software, 2014

That observation, a deliberate riff on the opening line of Anna Karenina, is not just clever. It is a precise description of why data wrangling is so difficult to generalize. There is no single procedure that cleans all datasets, because the ways data can be broken are essentially infinite. What Python gives you, though, is a powerful and composable set of tools for handling the patterns that show up again and again.

This article walks through what data wrangling actually involves, why Python dominates this space, how to apply the core techniques using real code, and which Python Enhancement Proposals (PEPs) underpin the ecosystem that makes it all work. No copy-paste snippets you will never understand. Actual comprehension.

Why Python Became the Language of Data Wrangling

Python was not originally designed for data analysis. It was a general-purpose scripting language that happened to have an unusually readable syntax and a culture of building libraries to solve practical problems. The pivot toward data came when Wes McKinney, then a quant finance professional at AQR Capital Management, grew frustrated with the tools available for routine data manipulation tasks. In a 2024 interview on the Talk Python To Me podcast, McKinney described how he ended up building what became pandas:

“I was working in quant finance at the time. I started building a personal data analysis toolkit that turned into the pandas project.” — Wes McKinney, Talk Python To Me, Episode 462, May 2024

The core insight behind pandas was that Python needed something like R’s DataFrame: a structure for working with labeled, tabular data that made common operations — filtering rows, computing new columns, merging tables — simple and expressive. McKinney started building pandas in April 2008, AQR released the source code in 2009, and by the early 2010s, pandas had become the backbone of data work in Python.

In a 2018 interview with Quartz, McKinney explained what pandas actually does for people who are not programmers: “It enables people to analyze and work with data who are not expert computer scientists. You still have to write code, but it’s making the code intuitive and accessible.” In a more recent September 2025 interview for the Pretty Powerful pandas series, McKinney reflected on the library’s enduring role: “I see it as more of this ever expanding Swiss army knife of data... I think it’s always gonna be there as this Swiss army knife of small to medium data.”

The impact of this design choice — treating in-memory data like you would a SQL table, with labeled rows and columns — cannot be overstated. Before pandas, performing intuitive tabular analysis and exploration in Python required cobbling together NumPy arrays, dictionaries, and ad hoc indexing. The DataFrame abstracted all of that away and gave Python something that R had long enjoyed: a first-class structure for working with real-world, heterogeneous data.

The PEPs That Make Data Wrangling Possible

Python Enhancement Proposals are the formal mechanism through which changes to the Python language and standard library are proposed and documented. Several PEPs are directly relevant to the data wrangling ecosystem, even though many practitioners never interact with them explicitly. Understanding these helps you understand why the tools work the way they do.

PEP 3118 — Revising the Buffer Protocol

This PEP, authored by Travis Oliphant and Carl Banks, redesigned how Python objects share memory. The buffer protocol allows libraries like NumPy and pandas to exchange data without making expensive copies. When you create a pandas DataFrame backed by NumPy arrays, the underlying memory is shared efficiently because both libraries speak the same buffer protocol.

Note

Without PEP 3118, the entire scientific Python stack would be dramatically slower, because every interaction between libraries would require copying data rather than sharing references to it.

PEP 249 — Python Database API Specification v2.0

This PEP defines the standard interface that all Python database drivers must implement. When you use pandas.read_sql() to pull data from PostgreSQL, MySQL, SQLite, or any other relational database, pandas is relying on a database driver that conforms to PEP 249. The standardized interface means that pd.read_sql("SELECT * FROM users", connection) works essentially the same way regardless of which database you are connecting to. The PEP was originally written in 1999 by Marc-André Lemburg and has been refined over the decades, but its core contribution is the consistent connect() / cursor() / execute() / fetch() pattern that every Python database library follows.

PEP 484 — Type Hints

Introduced in Python 3.5, type hints allow you to annotate function signatures with expected types. Modern pandas (version 1.0 and later) uses type hints extensively in its public API, which means your IDE can now provide meaningful autocompletion and type checking when you are writing data wrangling code. Libraries like pandera build directly on top of type hints to provide runtime data validation for DataFrames.

Pro Tip

If you have ever written a data pipeline that silently converted your integer column into floats because of a single null value, you understand why runtime type enforcement matters. Consider adding pandera schema validation to any pipeline that runs in production. And note that pandas 1.0+ introduced nullable data types — such as Int64 (capital I), boolean, and string — backed by pd.NA instead of np.nan, which solves the integer-to-float coercion problem entirely. Use df = pd.read_csv("data.csv", dtype_backend="numpy_nullable") to opt in.

PEP 574 — Pickle Protocol 5 with Out-of-Band Data

When you need to send large DataFrames between processes — for parallel processing with Dask or multiprocessing — Python uses pickle for serialization. Before PEP 574, this meant copying the entire dataset’s memory into the pickle stream. The PEP, authored by Antoine Pitrou and accepted for Python 3.8, introduced a mechanism for passing large buffers out-of-band — separately from the main pickle stream — enabling zero-copy data transfer when the application supports it. The PEP explicitly mentions both NumPy arrays and pandas DataFrames as motivating use cases.

The Core Operations of Data Wrangling

Data wrangling is not one thing. It is a collection of distinct operations that you apply in different combinations depending on what is wrong with your data. Here are the operations that matter, with code that shows you how they actually work rather than just what to type.

Loading and Inspecting Data

Before you can fix anything, you need to understand what you are working with. This sounds obvious, but skipping this step is probably the single most common mistake in data work.

import pandas as pd

# Load data from various sources
df = pd.read_csv("sales_data.csv")

# Get the shape: how many rows, how many columns
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")

# Look at the first few rows -- not the summary statistics, the actual data
print(df.head(10))

# Check data types -- this is where problems hide
print(df.dtypes)

# How many nulls per column, as a percentage
null_pct = df.isnull().sum() / len(df) * 100
print(null_pct[null_pct > 0].sort_values(ascending=False))

The dtypes output is where you catch the silent killers. A column called revenue that shows as object instead of float64 means there are non-numeric values hiding in your data. A date column stored as a string means your time-series operations will fail. Catch these problems early.

Handling Missing Values

Missing data is not a single problem. It comes in several flavors, and the right approach depends on why the data is missing and what you intend to do with the dataset.

# Strategy 1: Drop rows where specific critical columns are null
# Use this when the missing data makes the row meaningless
df_clean = df.dropna(subset=["customer_id", "order_date"])

# Strategy 2: Fill with a sensible default
# Use this when absence of a value has a meaningful interpretation
df["discount"] = df["discount"].fillna(0)

# Strategy 3: Fill with a group-level statistic
# Use this when the value is likely related to some grouping variable
df["revenue"] = df.groupby("region")["revenue"].transform(
    lambda x: x.fillna(x.median())
)

# Strategy 4: Interpolate for time-series data
# Use this when values change smoothly over time
df = df.sort_values("date")
df["temperature"] = df["temperature"].interpolate(method="linear")

Warning

There is no universal “correct” strategy for missing data. Dropping rows loses information. Filling with the mean distorts the distribution. Interpolation assumes smooth continuity. The right choice depends on your data and your analysis, and the worst choice is to ignore the problem entirely.

Cleaning and Standardizing Text Data

String data is where the chaos really lives. People type things differently. Systems export data with trailing whitespace. Encoding problems turn perfectly good characters into garbage.

# Standardize text: strip whitespace, lowercase, fix encoding issues
df["product_name"] = (
    df["product_name"]
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)  # collapse multiple spaces
)

# Standardize categorical values that mean the same thing
status_map = {
    "cancelled": "canceled",
    "can": "canceled",
    "complete": "completed",
    "comp": "completed",
    "shipped": "shipped",
}
df["order_status"] = df["order_status"].str.lower().str.strip().map(status_map)

# Parse dates that arrive in inconsistent formats
df["order_date"] = pd.to_datetime(df["order_date"], format="mixed", dayfirst=False)

That format="mixed" parameter in pd.to_datetime() was introduced in pandas 2.0. It tells pandas to try parsing each value individually rather than assuming a single format for the entire column — which handles the real-world case where one system exported dates as “2024-01-15” and another as “Jan 15, 2024” and both ended up in the same column.

Reshaping Data: Pivoting and Melting

The concept of “tidy data” — each variable forms a column, each observation forms a row, each type of observational unit forms a table — comes from Hadley Wickham’s 2014 paper in the Journal of Statistical Software. The two core reshaping operations are melting (going from wide to long) and pivoting (going from long to wide).

# MELTING: Wide format -> Long format
# Suppose you have quarterly revenue as separate columns
wide_df = pd.DataFrame({
    "company": ["Acme", "Globex"],
    "Q1_revenue": [100, 200],
    "Q2_revenue": [150, 250],
    "Q3_revenue": [120, 300],
    "Q4_revenue": [180, 280],
})

long_df = wide_df.melt(
    id_vars=["company"],
    value_vars=["Q1_revenue", "Q2_revenue", "Q3_revenue", "Q4_revenue"],
    var_name="quarter",
    value_name="revenue",
)

# Clean up the quarter column so it is just Q1, Q2, etc.
long_df["quarter"] = long_df["quarter"].str.replace("_revenue", "")

# PIVOTING: Long format -> Wide format
# Sometimes you need to go the other direction
pivot_df = long_df.pivot_table(
    index="company",
    columns="quarter",
    values="revenue",
    aggfunc="sum",
)

Pro Tip

Use pivot_table rather than pivot when working with real data. pivot_table handles duplicate entries by aggregating them; pivot raises an error. In real data, duplicates are common.

Merging and Joining Datasets

Real analysis almost always requires combining data from multiple sources. Pandas provides merge(), which implements the same join semantics you know from SQL.

# Bring together order data with customer information
orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

# Inner join: only rows with matches in BOTH tables
merged = pd.merge(orders, customers, on="customer_id", how="inner")

# Left join: keep ALL orders, attach customer info where available
merged = pd.merge(orders, customers, on="customer_id", how="left")

# Check for unmatched rows -- this is quality control
unmatched_orders = merged[merged["customer_name"].isnull()]
print(f"Orders without customer match: {len(unmatched_orders)}")

That last step — checking for unmatched rows after a join — is the kind of thing that separates careful data work from sloppy data work. A left join that produces unexpected nulls is telling you that your key column has mismatches, which might mean data quality issues upstream.

Vectorized Operations and Why Loops Are a Mistake

One of the things that catches Python newcomers off guard is how slow row-by-row iteration is in pandas. This is not a minor performance difference. It can be the difference between a computation that takes one second and one that takes ten minutes.

# THE WRONG WAY: iterating row by row
# This works but is painfully slow on large datasets
for index, row in df.iterrows():
    df.at[index, "profit"] = row["revenue"] - row["cost"]

# THE RIGHT WAY: vectorized operation
# This does the same thing, orders of magnitude faster
df["profit"] = df["revenue"] - df["cost"]

# Conditional logic, vectorized
import numpy as np
df["margin_category"] = np.where(
    df["profit"] / df["revenue"] > 0.3,
    "high_margin",
    "low_margin",
)

The reason vectorized operations are faster is that they push the computation down into NumPy’s compiled C implementation, which operates on contiguous blocks of memory rather than individual Python objects. When you write a Python for-loop, you are paying the overhead of Python’s interpreter for every single row — type checking, attribute lookups, and dynamic dispatch on each iteration. When you write a vectorized operation, NumPy handles the entire array in a single C-level loop, and you pay Python’s overhead only once.

Warning

Watch out for chained indexing like df[df["region"] == "West"]["revenue"] = 0. This often operates on a copy of the data rather than the original DataFrame, leading to the notorious SettingWithCopyWarning. The fix is to use .loc for assignment: df.loc[df["region"] == "West", "revenue"] = 0. Starting in pandas 3.0, copy-on-write behavior is the default, which eliminates this class of bugs entirely.

GroupBy: Split-Apply-Combine

The groupby pattern is one of the most powerful concepts in data wrangling. You split your data into groups based on some criterion, apply an operation to each group independently, and combine the results back together.

# Basic aggregation: total revenue by region
regional_summary = (
    df.groupby("region")["revenue"]
    .agg(["sum", "mean", "count"])
    .rename(columns={"sum": "total_revenue", "mean": "avg_revenue", "count": "order_count"})
    .sort_values("total_revenue", ascending=False)
)

# Multiple aggregations on different columns
summary = df.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_order_size=("revenue", "mean"),
    unique_customers=("customer_id", "nunique"),
    latest_order=("order_date", "max"),
)

# Transform: compute a value per group and broadcast it back
# This is useful for adding context without collapsing the DataFrame
df["region_avg_revenue"] = df.groupby("region")["revenue"].transform("mean")
df["pct_of_region_avg"] = df["revenue"] / df["region_avg_revenue"] * 100

The transform method is the one that many people overlook. It solves an extremely common problem: you want to compute something at the group level (like a regional average) and then compare each individual row against that group statistic, all without losing the row-level detail.

Method Chaining: Writing Readable Pipelines

One of the most practical patterns for writing maintainable wrangling code is method chaining. Rather than creating a dozen intermediate variables, you chain operations together so the entire transformation reads as a pipeline.

result = (
    pd.read_csv("raw_sales.csv")
    .rename(columns=str.lower)
    .assign(order_date=lambda d: pd.to_datetime(d["order_date"], format="mixed"))
    .query("order_status == 'completed'")
    .assign(profit=lambda d: d["revenue"] - d["cost"])
    .groupby("region", as_index=False)
    .agg(
        total_profit=("profit", "sum"),
        order_count=("order_id", "count"),
    )
    .sort_values("total_profit", ascending=False)
)

This is a complete pipeline: load, rename, parse dates, filter, compute a derived column, aggregate, and sort. Every step is visible. If something goes wrong, you can break the chain at any point and inspect the intermediate DataFrame.

For operations that do not natively return a DataFrame — such as custom validation functions or logging steps — pandas provides .pipe(), which lets you slot any function into a method chain. For example, .pipe(log_shape) could print the DataFrame’s dimensions mid-pipeline without breaking the flow. This small tool makes the difference between a chain that only works for built-in operations and one that accommodates your entire workflow.

Beyond pandas: The Modern Wrangling Ecosystem

Pandas is not the only tool in the box. As datasets have grown larger and workloads have become more demanding, the Python ecosystem has expanded significantly.

Polars is a DataFrame library written in Rust that provides significantly faster performance than pandas for many operations, particularly on large datasets. It offers a lazy evaluation API that builds a query plan before executing, allowing it to optimize the entire pipeline — reordering operations, eliminating unnecessary work, and parallelizing across CPU cores — rather than executing each step individually.

DuckDB brings an embedded analytical SQL engine into Python. It can query pandas DataFrames, Parquet files, and CSV files directly without requiring a separate server, and for SQL-shaped wrangling tasks it is often dramatically faster than doing the same work in pandas.

Dask extends the pandas API to work on datasets that do not fit in memory by partitioning them across multiple blocks and parallelizing operations.

Notably, pandas itself has evolved to keep pace. Pandas 2.0 (released April 2023) introduced optional Apache Arrow-backed data types, which can be enabled globally with pd.options.mode.dtype_backend = "pyarrow" or per-read with dtype_backend="pyarrow". Arrow-backed columns offer significantly better memory efficiency for string data, native nullable types that avoid the integer-to-float coercion problem, and faster operations for many common tasks. This is not a replacement for Polars or DuckDB, but it means that many of the performance wins those tools offer are now accessible without leaving the pandas API.

McKinney himself acknowledged this evolving landscape in his May 2024 Talk Python interview, noting that while alternatives have emerged, pandas retains its central role: “Pandas is the on-ramp for all of the data into your environment in Python. When people are building some application that touches data in Python, pandas is often the initial on-ramp for how data gets into Python, where you clean it up, regularize it, get it ready for analysis.”

Putting It All Together

Data wrangling is not glamorous work. It does not produce the visualizations that end up in presentations or the models that get deployed to production. But it is the foundation that everything else is built on. A model trained on poorly wrangled data will produce poorly calibrated predictions. A dashboard built on unclean data will generate distrust rather than insight.

Python, through pandas and the broader ecosystem, gives you a set of tools that are expressive enough to handle the full range of data quality problems you will encounter in practice. The key is not to memorize syntax — it is to understand the operations conceptually: inspecting, cleaning, reshaping, merging, aggregating, and transforming. Once you understand what you are trying to do, the pandas documentation will show you how to do it.

Key Takeaways

Inspection first, always: Use dtypes, isnull().sum(), and head() before touching anything. Problems you catch early are problems you fix once.
Missing data requires a decision, not a default: Dropping, filling, interpolating, and imputing are all valid — but each has tradeoffs. Know which one you are making and why.
Vectorize everything: Row-by-row loops in pandas are orders of magnitude slower than vectorized operations. If you find yourself writing for index, row in df.iterrows(), stop and think about the vectorized equivalent.
Use .loc for assignment: Chained indexing leads to subtle bugs. Always use df.loc[condition, column] for setting values, and consider enabling copy-on-write behavior for safer data manipulation.
Method chaining produces readable, auditable pipelines: Chaining operations into a single expression makes your transformations easy to follow, debug, and maintain. Use .pipe() to integrate custom functions into your chains.
The ecosystem has grown beyond pandas: For large datasets, Polars, DuckDB, and Dask are worth knowing. Pandas 2.0’s Arrow-backed types bring many performance wins without leaving the pandas API. But pandas remains the on-ramp and the shared vocabulary of Python data work.

The Zen of Python (PEP 20) says there should be one — and preferably only one — obvious way to do things. McKinney himself referenced this principle when explaining why having pandas as a shared standard matters: “I think one reason pandas has gotten so popular is that it’s beneficial to the community to have fewer solutions. It’s the Zen of Python — there should be one and preferably only one obvious way to do things.” That is what a good data wrangling workflow looks like. Not a bag of tricks, but a principled approach to transforming raw data into something trustworthy.