Python Data Preprocessing: The Complete Hands-On Guide

Raw data is messy. It arrives riddled with missing values, inconsistent formats, wildly different scales, and hidden outliers that can quietly sabotage a model. Feeding it directly into a machine learning algorithm is a recipe for garbage predictions, or worse, confidently wrong ones. Data preprocessing is the disciplined first step that transforms chaotic raw data into something clean, structured, and ready for real analysis. This guide walks through every core preprocessing technique in Python, including the conceptual reasoning behind each decision, with practical code you can use immediately.

Whether you are preparing data for a classification model, a regression task, or exploratory analysis, the techniques covered here form the foundation of every data science workflow. The examples use pandas, NumPy, and scikit-learn (version 1.8+), the standard libraries for preprocessing in Python.

Why Preprocessing Matters

Machine learning algorithms expect numerical input in a consistent, well-structured format. Real-world datasets almost never arrive that way. They contain blank cells, text labels where numbers are expected, features measured on completely different scales, duplicate records that skew distributions, and outliers that can silently pull trained weights in the wrong direction.

Skipping preprocessing leads directly to poor model performance. A decision tree trained on unclean data will split on noise instead of signal. A neural network fed unscaled features will struggle to converge. A logistic regression model given raw categorical strings will simply crash. Even when a model does run, undetected data quality problems mean the evaluation metrics you see during development will not hold up in production.

In its 2017 annual survey, CrowdFlower (now Appen) found that 51% of data science respondents named collecting, labeling, cleaning, and organizing data as the activity consuming the largest share of their working time. A Forbes analysis drawing on that same research described data preparation as roughly 80% of the overall data science role when data collection is included. The exact ratio varies by team and project, but the pattern is consistent: preprocessing is where the work happens, and it is where project outcomes are determined.

"Garbage in, garbage out" is a phrase often attributed to early IBM programmer George Fuechsel, and it remains the most concise summary of why preprocessing matters. No algorithm can recover signal that was destroyed before it ever saw the data.
Note

The examples in this guide use a small synthetic dataset so you can follow along without downloading anything. Every technique shown here scales directly to real-world datasets with millions of rows.

Loading and Inspecting Your Data

Before transforming anything, you need to understand what you are working with. Loading the data into a pandas DataFrame and running a few inspection commands reveals the shape of the dataset, the data types of each column, and where problems might be hiding. This step is the difference between targeted, efficient preprocessing and blind trial and error.

import pandas as pd
import numpy as np

# Create a sample dataset
data = {
    "age": [25, 30, np.nan, 45, 35, 28, 40, np.nan, 50, 33],
    "salary": [50000, 60000, 55000, np.nan, 70000, 48000, 85000, 62000, np.nan, 58000],
    "city": ["New York", "London", "Paris", "London", "New York",
             "Paris", "London", "New York", "Paris", "London"],
    "department": ["Sales", "Engineering", "Sales", "Engineering",
                   "HR", "Sales", "HR", "Engineering", "Sales", "HR"],
    "purchased": [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
print(df.head())

With the DataFrame loaded, three commands give you a complete picture of the data:

# Shape: rows and columns
print(f"Shape: {df.shape}")

# Data types and non-null counts
print(df.info())

# Statistical summary for numeric columns
print(df.describe())

The info() method is particularly revealing. It shows the count of non-null entries for each column, making it easy to spot where missing values exist. The describe() method gives you the mean, standard deviation, min, max, and quartiles for every numeric column, which helps identify potential outliers and unusual distributions.

Two additional inspection steps that many tutorials skip are checking the number of unique values per column and examining the actual distribution of values. These reveal subtle problems like a column that looks numeric but actually contains hidden string entries, or a categorical column where minor spelling variations create phantom categories.

# Unique values per column
print(df.nunique())

# Value distribution for categorical columns
print(df["city"].value_counts())

# Check for unexpected whitespace or casing issues
print(df["city"].str.strip().str.lower().value_counts())

Data Type Conversion

One of the least discussed but most impactful preprocessing steps is verifying and correcting data types. When data is loaded from CSVs, APIs, or databases, pandas infers types based on the content it encounters. It often guesses wrong. A column containing numeric IDs may load as integers when it should be treated as a categorical string. A date column may arrive as a plain string. A boolean column may contain the strings "True" and "False" instead of actual boolean values.

Incorrect types cause three categories of problems. First, they lead to inappropriate mathematical operations, like computing the mean of zip codes. Second, they waste memory, since storing a column with only two unique values as a string is far less efficient than storing it as a pandas category type. Third, they cause encoding functions to silently produce wrong results.

# Convert a column that should be categorical
df["department"] = df["department"].astype("category")

# Convert a string date column to datetime
# df["signup_date"] = pd.to_datetime(df["signup_date"])

# Force a column to numeric, coercing errors to NaN
# df["age"] = pd.to_numeric(df["age"], errors="coerce")

# Check memory savings from category conversion
print(df["department"].memory_usage(deep=True))
Pro Tip

On large datasets, converting string columns with low cardinality to pandas category dtype can reduce memory usage by 90% or more. Run df.memory_usage(deep=True) before and after conversion to see the impact.

Handling Missing Values

Missing data is one of the first problems you will encounter, and also one of the most consequential. The strategy you choose for handling it should depend not only on how much data is missing, but on why it is missing. This distinction is critical and is where many preprocessing tutorials fall short.

Understanding Missing Data Mechanisms

In 1976, statistician Donald Rubin introduced a classification framework for missing data that remains the foundation of modern practice. His three categories, later expanded upon in the book Statistical Analysis with Missing Data by Rubin and Roderick Little (Wiley, 2019), describe three fundamentally different situations that demand different responses.

Missing Completely at Random (MCAR) means the probability that a value is missing has no relationship to any variable in the dataset, observed or unobserved. An example would be lab samples lost due to a random equipment failure. When data is MCAR, dropping incomplete rows does not introduce bias, though it does reduce statistical power.

Missing at Random (MAR) means the missingness depends on other observed variables, but not on the missing value itself. For example, if younger survey respondents are less likely to report their income, the income data is MAR because missingness is predicted by the observed age column. Imputation methods that leverage relationships between variables (like KNNImputer or IterativeImputer) work well under MAR because they use those observed predictors to estimate the missing values.

Missing Not at Random (MNAR) means the probability of missingness depends on the unobserved value itself. If people with very high incomes deliberately skip the income question on a survey, the data is MNAR. This is the hardest scenario to handle because the information needed to correct the bias is the information that is missing. MNAR typically requires domain-specific modeling or sensitivity analysis to assess its impact.

Note

You cannot determine the missing data mechanism by looking at the data alone. You need domain knowledge about how the data was collected and why values might be absent. As Stef van Buuren emphasizes in Flexible Imputation of Missing Data (CRC Press, 2018), a causal diagram of your data-generating process is the best tool for reasoning about which mechanism applies.

Detecting Missing Values

# Count missing values per column
print(df.isnull().sum())

# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)

# Visualize the pattern of missingness
# (helps distinguish MCAR from MAR)
print(df[df["age"].isnull()])

Strategy 1: Dropping Rows or Columns

If a small percentage of rows have missing values and you have reason to believe the data is MCAR, dropping them is the simplest approach. If an entire column is mostly empty, dropping the column makes more sense than trying to fill it. However, deletion under MAR or MNAR conditions can introduce bias that is invisible in your evaluation metrics but damaging in production.

# Drop rows with any missing value
df_dropped = df.dropna()

# Drop rows only if a specific column is missing
df_dropped_age = df.dropna(subset=["age"])

# Drop columns where more than 50% of values are missing
threshold = len(df) * 0.5
df_dropped_cols = df.dropna(axis=1, thresh=threshold)
Warning

Dropping rows reduces your dataset size. On small datasets, this can hurt model performance more than the missing values would have. More importantly, if the data is not MCAR, deletion can introduce systematic bias that silently degrades your model. Prefer imputation when you cannot afford to lose data or when you suspect the missingness is not purely random.

Strategy 2: Simple Imputation

Imputation fills missing values with a calculated replacement. The simplest approaches use the column mean, median, or mode. scikit-learn's SimpleImputer handles this cleanly and integrates well with pipelines.

from sklearn.impute import SimpleImputer

# Mean imputation for numeric columns
imputer_mean = SimpleImputer(strategy="mean")
df[["age", "salary"]] = imputer_mean.fit_transform(df[["age", "salary"]])

print(df[["age", "salary"]].isnull().sum())  # Both should be 0

The available strategies for SimpleImputer are "mean", "median", "most_frequent", and "constant". Use median when your data has outliers, since the mean gets pulled toward extreme values. Use "most_frequent" for categorical columns.

# Median imputation (robust to outliers)
imputer_median = SimpleImputer(strategy="median")

# Constant imputation (fill with a specific value)
imputer_constant = SimpleImputer(strategy="constant", fill_value=0)

# Mode imputation for categorical data
imputer_mode = SimpleImputer(strategy="most_frequent")

Strategy 3: Advanced Imputation

Simple imputation has a fundamental limitation: it ignores relationships between features. If age and salary are correlated, filling missing ages with the global median discards the information that salary could provide. More advanced imputers solve this by using the structure of your data to make better guesses.

from sklearn.impute import KNNImputer

# KNN imputation: fill based on similar rows
knn_imputer = KNNImputer(n_neighbors=3)
df_knn = pd.DataFrame(
    knn_imputer.fit_transform(df[["age", "salary"]]),
    columns=["age", "salary"]
)

scikit-learn also provides IterativeImputer (experimental), which models each feature with missing values as a function of all other features, iterating until values converge. This approach is inspired by the MICE (Multiple Imputation by Chained Equations) algorithm, one of the standard tools in statistical practice for handling MAR data.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Iterative imputation: models each feature from all others
iter_imputer = IterativeImputer(max_iter=10, random_state=42)
df_iter = pd.DataFrame(
    iter_imputer.fit_transform(df[["age", "salary"]]),
    columns=["age", "salary"]
)
Pro Tip

A powerful technique is to add a binary "missingness indicator" column before imputing. This preserves the information that a value was originally missing, which can itself be predictive. For example, the fact that a customer did not provide their income may correlate with purchasing behavior, regardless of what the imputed income value is.

Removing Duplicates

Duplicate records inflate your dataset and bias your model toward repeated patterns. They also create a subtle problem during train-test splitting: if the same record appears in both the training set and the test set, your evaluation metrics become unrealistically optimistic because the model is effectively being tested on data it has already seen.

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Remove exact duplicates
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=["age", "city"], keep="first")

The keep parameter controls which duplicate to retain. Set it to "first" to keep the first occurrence, "last" to keep the last, or False to drop all duplicates entirely.

For fuzzy or near-duplicate detection, where records refer to the same entity but differ slightly due to typos or formatting inconsistencies, exact matching is not enough. Techniques like Levenshtein distance on string fields, or the recordlinkage library in Python, can identify these soft duplicates that simple drop_duplicates() will miss.

Detecting and Handling Outliers

Outliers are data points that deviate significantly from the rest of the distribution. They can originate from measurement errors, data entry mistakes, or genuine rare events. The critical question is not just "is this value unusual?" but "will this value harm my model, and should I keep it?"

Blindly removing all outliers is a common mistake. In fraud detection, the outliers are the signal. In medical diagnostics, rare values may represent the patients who need attention the most. The right approach depends entirely on your domain and your model.

IQR Method

The Interquartile Range (IQR) method is the standard statistical technique for identifying outliers. It calculates the range between the 25th percentile (Q1) and the 75th percentile (Q3), then flags any point that falls more than 1.5 times the IQR below Q1 or above Q3. This approach works well on skewed distributions because it relies on the median and quartiles rather than the mean and standard deviation.

# IQR-based outlier detection
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Flag outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(f"Outliers detected: {len(outliers)}")

# Remove outliers
df_clean = df[(df["salary"] >= lower_bound) & (df["salary"] <= upper_bound)]

Z-Score Method

For normally distributed data, the Z-score method flags values that fall more than a certain number of standard deviations from the mean. A threshold of 3 is conventional, meaning any value more than three standard deviations from the mean is considered an outlier.

from scipy import stats

z_scores = np.abs(stats.zscore(df["salary"].dropna()))
outlier_mask = z_scores > 3
print(f"Z-score outliers: {outlier_mask.sum()}")

Capping Instead of Removing

When you cannot afford to lose data points, capping (also called winsorization) replaces extreme values with the nearest acceptable boundary value instead of deleting the entire row. This preserves dataset size while limiting the influence of outliers.

# Cap outliers at the IQR boundaries
df["salary_capped"] = df["salary"].clip(lower=lower_bound, upper=upper_bound)
Note

Tree-based models (random forests, gradient boosting) are naturally robust to outliers because they split on rank-order thresholds, not raw values. Linear models, SVMs, and k-nearest neighbors are all sensitive to outliers. Your choice of model should inform how aggressively you handle extreme values.

Encoding Categorical Variables

Machine learning algorithms work with numbers. Categorical columns like "city" and "department" need to be converted into a numerical representation before they can be used as model inputs. The encoding you choose has real consequences for model performance, and the right choice depends on three factors: whether the categories have a natural order, how many unique categories exist (cardinality), and what model you plan to use.

Label Encoding and Ordinal Encoding

Label encoding assigns a unique integer to each category. This works well for ordinal data where the categories have a natural order, such as education level (high school, bachelor's, master's, doctorate). For features used as model inputs (rather than target labels), scikit-learn recommends OrdinalEncoder over LabelEncoder, because OrdinalEncoder is designed for feature columns and integrates properly with ColumnTransformer and pipelines, while LabelEncoder is intended only for encoding the target variable.

from sklearn.preprocessing import OrdinalEncoder

# Define the meaningful order explicitly
education_order = [["high school", "bachelors", "masters", "doctorate"]]
ord_enc = OrdinalEncoder(categories=education_order)

# For demonstration with our dataset
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["department_encoded"] = le.fit_transform(df["department"])

print(df[["department", "department_encoded"]])
# Engineering -> 0, HR -> 1, Sales -> 2
Warning

Integer encoding imposes an artificial order on nominal categories. If "Engineering" becomes 0 and "Sales" becomes 2, a linear model will interpret Sales as "greater than" Engineering. For nominal data with no inherent order, use one-hot encoding instead. Tree-based models, however, can handle ordinal-encoded nominal data without this issue because they split on thresholds rather than interpreting numeric magnitude.

One-Hot Encoding

One-hot encoding creates a new binary column for each category. This avoids the ordering problem entirely because each category is represented by its own independent column.

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
city_encoded = ohe.fit_transform(df[["city"]])

# Convert to DataFrame with meaningful column names
city_df = pd.DataFrame(
    city_encoded, columns=ohe.get_feature_names_out(["city"])
)
print(city_df.head())

Pandas also offers a quick shortcut with get_dummies():

# Quick one-hot encoding with pandas
df_encoded = pd.get_dummies(df, columns=["city", "department"], drop_first=True)
print(df_encoded.head())

The drop_first=True parameter removes one of the generated columns per feature to avoid multicollinearity. If a row is not London and not Paris, it must be New York, so that third column carries no additional information. This matters for linear models where multicollinearity causes unstable coefficient estimates, but is irrelevant for tree-based models.

Target Encoding for High-Cardinality Features

When a categorical column has hundreds or thousands of unique values (zip codes, product IDs, city names across an entire country), one-hot encoding becomes impractical because it creates an enormous number of sparse columns. scikit-learn's TargetEncoder (introduced in version 1.3) solves this by replacing each category with a smoothed estimate of the target variable's mean for that category. It uses internal cross-fitting to prevent target leakage during training.

from sklearn.preprocessing import TargetEncoder

# Target encoding for a high-cardinality column
target_enc = TargetEncoder(smooth="auto")

# fit_transform uses cross-fitting to prevent leakage
city_target = target_enc.fit_transform(df[["city"]], df["purchased"])
print(city_target)

This approach was originally described by Daniele Micci-Barreca in a 2001 SIGKDD paper on preprocessing high-cardinality categorical attributes. The scikit-learn documentation notes that TargetEncoder is well-suited for high-cardinality features, while OneHotEncoder works best with low to medium cardinality. For a comparison between these approaches, see the scikit-learn example "Comparing Target Encoder with Other Encoders" in the official documentation at scikit-learn.org.

Feature Scaling

Features measured on different scales can cause problems for algorithms that rely on distance calculations or gradient descent. If "age" ranges from 25 to 50 while "salary" ranges from 48,000 to 85,000, the salary feature will dominate simply because its values are larger. The scikit-learn documentation states that many estimators behave poorly when features do not resemble standard normally distributed data with zero mean and unit variance.

Standardization (Z-Score Scaling)

Standardization transforms each feature to have a mean of 0 and a standard deviation of 1. This is the right choice for algorithms that assume normally distributed data or use regularization, such as logistic regression and support vector machines.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[["age_scaled", "salary_scaled"]] = scaler.fit_transform(
    df[["age", "salary"]]
)

print(df[["age", "age_scaled", "salary", "salary_scaled"]].head())

Normalization (Min-Max Scaling)

Min-max scaling compresses all values into a fixed range, typically 0 to 1. This is useful for algorithms that do not assume any particular distribution, such as neural networks and k-nearest neighbors, and for situations where you need bounded output values.

from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()
df[["age_norm", "salary_norm"]] = minmax.fit_transform(
    df[["age", "salary"]]
)

print(df[["age", "age_norm", "salary", "salary_norm"]].head())
Note

Not every algorithm requires feature scaling. Tree-based models like random forests and gradient boosting are invariant to feature scale because they split on thresholds rather than computing distances. Scaling is critical for SVMs, k-nearest neighbors, logistic regression, and neural networks.

Robust Scaling

When your data contains outliers, robust scaling uses the median and interquartile range instead of the mean and standard deviation, making it resistant to extreme values. The scikit-learn documentation recommends robust scalers when outliers are present in the dataset.

from sklearn.preprocessing import RobustScaler

robust = RobustScaler()
df[["age_robust", "salary_robust"]] = robust.fit_transform(
    df[["age", "salary"]]
)

print(df[["age", "age_robust"]].head())

When to Use Which Scaler

Choosing the wrong scaler is a subtle error that rarely causes a crash but quietly degrades performance. Use StandardScaler when your data is approximately normal and outlier-free. Use MinMaxScaler when you need values in a bounded range (common for neural network inputs or image pixel data). Use RobustScaler when outliers are present and you cannot or should not remove them. If your data has a heavy-tailed or highly skewed distribution, consider applying a log or power transformation (covered next) before scaling.

Feature Transformations

Scaling adjusts the range of your features, but it does not change their shape. Many real-world features have skewed distributions, where a small number of extreme values stretch the tail far to the right (income, home prices, transaction amounts). Skewed features can undermine models that assume symmetry and degrade performance in distance-based algorithms even after scaling.

Log Transformation

The logarithmic transformation compresses large values and spreads small ones, pulling skewed distributions closer to a symmetric, bell-shaped curve. This is one of the simplest and most effective techniques for dealing with right-skewed data.

# Log transformation for right-skewed data
# Add 1 to handle zero values (log(0) is undefined)
df["salary_log"] = np.log1p(df["salary"])

print(df[["salary", "salary_log"]].head())

Power Transformations

scikit-learn provides PowerTransformer with two methods: Yeo-Johnson (handles both positive and negative values) and Box-Cox (requires strictly positive values). Both automatically find the optimal transformation parameter to make the data as close to Gaussian as possible.

from sklearn.preprocessing import PowerTransformer

# Yeo-Johnson works with any real-valued data
pt = PowerTransformer(method="yeo-johnson")
df[["salary_power"]] = pt.fit_transform(df[["salary"]])

print(df[["salary", "salary_power"]].head())

Discretization (Binning)

Sometimes a continuous feature is more useful when converted into discrete buckets. Age, for example, may matter to a model not as a precise number but as a life stage. scikit-learn's KBinsDiscretizer automates this with three strategies: uniform intervals, quantile-based bins, or k-means clustering.

from sklearn.preprocessing import KBinsDiscretizer

# Create 4 age buckets based on quantiles
binner = KBinsDiscretizer(
    n_bins=4, encode="ordinal", strategy="quantile"
)
df[["age_binned"]] = binner.fit_transform(df[["age"]])

print(df[["age", "age_binned"]].head())
Pro Tip

Feature transformations can also be embedded inside a pipeline using FunctionTransformer. This lets you apply arbitrary Python functions as pipeline steps, keeping your entire workflow reproducible and leak-free. Example: FunctionTransformer(np.log1p) applies a log transformation within a pipeline.

Splitting Into Training and Test Sets

Before training a model, you must split your data into a training set and a test set. The model learns from the training set and is evaluated on the test set, which it has never seen. This simulates how the model will perform on new, unseen data in production.

from sklearn.model_selection import train_test_split

# Separate features and target
X = df[["age", "salary"]]
y = df["purchased"]

# 80/20 split with a fixed random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} rows")
print(f"Test set:     {X_test.shape[0]} rows")
Warning

Always split your data before fitting any preprocessing transformers. The scikit-learn documentation on common pitfalls explicitly warns against this, stating that using all data when calling fit or fit_transform can produce overly optimistic evaluation scores. This is called data leakage, and it inflates your metrics while giving you a false sense of model performance. Source: scikit-learn.org, "Common pitfalls and recommended practices."

For imbalanced classification problems where the positive class is rare, use stratify=y in the split to ensure both the training and test sets contain the same proportion of each class. Without stratification, one set may end up with very few examples of the minority class, leading to unreliable evaluation.

# Stratified split preserves class proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Building Reusable Pipelines

When your preprocessing involves multiple steps across different column types, manually applying each transformation becomes tedious and error-prone. scikit-learn's Pipeline and ColumnTransformer let you bundle everything into a single, reusable object. This is not just a convenience feature. It is the difference between code that works in a notebook and code that works in production.

Why Pipelines?

Pipelines solve three problems at once. They prevent data leakage by ensuring that transformers are fit only on training data. They make your code cleaner by encapsulating the entire workflow in one object. And they integrate directly with cross-validation and hyperparameter search tools like GridSearchCV, which means every fold of cross-validation correctly re-fits the preprocessing steps from scratch on just the training portion of that fold.

ColumnTransformer for Mixed Data Types

Real datasets contain both numeric and categorical columns that need different transformations. ColumnTransformer lets you define separate processing paths for each column type, then combines the results automatically.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

# Define column groups
numeric_features = ["age", "salary"]
categorical_features = ["city", "department"]

# Build a pipeline for numeric columns
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Build a pipeline for categorical columns
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine both into a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

Full Pipeline with a Model

The preprocessor plugs directly into a full pipeline that includes the model itself. Calling fit() on this pipeline runs every preprocessing step and then trains the model in a single call.

# Create the full pipeline
clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression())
])

# Prepare features and target
X = df[["age", "salary", "city", "department"]]
y = df["purchased"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit the entire pipeline on training data
clf.fit(X_train, y_train)

# Evaluate on test data
score = clf.score(X_test, y_test)
print(f"Model accuracy: {score:.2f}")

This single clf object handles imputation, scaling, encoding, and classification. It can be serialized with joblib and deployed to production without any separate preprocessing scripts.

Pro Tip

Use remainder="passthrough" in ColumnTransformer to keep any columns not explicitly assigned to a transformer. The default behavior is remainder="drop", which silently discards unmentioned columns. This is one of the most common sources of bugs in pipeline-based workflows.

Custom Transformers with FunctionTransformer

Not every preprocessing step has a built-in transformer. When you need to apply a custom function, like extracting the day of the week from a date column or creating an interaction feature, scikit-learn's FunctionTransformer lets you wrap arbitrary Python functions into pipeline-compatible objects.

from sklearn.preprocessing import FunctionTransformer

# Wrap a custom function for use in a pipeline
log_transformer = FunctionTransformer(np.log1p, validate=True)

# Use it in a pipeline
numeric_transformer_v2 = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("log", log_transformer),
    ("scaler", StandardScaler())
])

Tuning Pipeline Hyperparameters

Because the pipeline chains named steps together, you can reference any parameter using the double-underscore syntax. This makes it possible to tune both preprocessing and model parameters in a single grid search.

from sklearn.model_selection import GridSearchCV

param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "classifier__C": [0.1, 1.0, 10.0]
}

grid_search = GridSearchCV(clf, param_grid, cv=3, scoring="accuracy")
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score:  {grid_search.best_score_:.2f}")

Saving and Loading Pipelines

A trained pipeline is only useful if you can deploy it. Serializing the entire pipeline with joblib captures every fitted transformer and the trained model in a single file. When loaded in production, calling predict() on new data applies the identical preprocessing chain that was used during training, with no manual steps required.

import joblib

# Save the entire trained pipeline
joblib.dump(clf, "preprocessing_pipeline.pkl")

# Load and use in production
loaded_pipeline = joblib.load("preprocessing_pipeline.pkl")
predictions = loaded_pipeline.predict(new_data)

Key Takeaways

  1. Always inspect before transforming: Use info(), describe(), nunique(), and isnull().sum() to understand the state of your data before applying any transformations.
  2. Fix data types early: Incorrect types cause silent errors downstream. Verify that numeric columns are numeric, dates are datetime objects, and low-cardinality strings are converted to the category dtype.
  3. Understand why data is missing: Rubin's classification (MCAR, MAR, MNAR) should guide your imputation strategy. Simple mean imputation is fine for MCAR; MAR requires methods that leverage feature relationships; MNAR demands domain knowledge.
  4. Detect outliers deliberately: Use IQR for skewed data and Z-scores for normal distributions. Decide based on your domain whether to remove, cap, or keep outliers. Some of the most important signals in your data live in the extremes.
  5. Encode categoricals correctly: Use ordinal encoding only for features with a natural order. Use one-hot encoding for low-cardinality nominal features. Use TargetEncoder for high-cardinality features where one-hot encoding would create too many columns.
  6. Match scaling to your model: SVMs, logistic regression, k-nearest neighbors, and neural networks all benefit from scaled features. Tree-based models do not. Use RobustScaler when outliers are present.
  7. Transform skewed features: Log transforms and power transforms can dramatically improve model performance on right-skewed data by pulling distributions closer to normal.
  8. Split before you fit: Always divide your data into training and test sets before fitting any preprocessor to prevent data leakage. Use stratification for imbalanced classification targets.
  9. Use pipelines for production-ready code: scikit-learn's Pipeline and ColumnTransformer bundle preprocessing and modeling into a single, serializable object that prevents leakage, simplifies deployment, and integrates with hyperparameter tuning.

Data preprocessing is not glamorous, but it is where the real work of data science happens. Clean, well-structured data makes the difference between a model that barely works and one that performs reliably in production. The techniques in this guide are not just routine steps to check off a list. Each one is a decision point where understanding the reasoning behind the technique matters as much as knowing the syntax. Master these techniques, build the habit of using pipelines, and your machine learning projects will start on solid ground every time.

back to articles