Python Regression Models: A Complete Guide with Code Examples

Regression is the backbone of predictive modeling. Whether you are forecasting housing prices, estimating sales revenue, or predicting patient medical costs, regression models give you a mathematical framework for understanding how input variables relate to a continuous output. Python's ecosystem of libraries, especially scikit-learn and statsmodels, makes it straightforward to build, evaluate, and compare regression models at every level of complexity.

This guide walks through the regression models you will encounter in practice, from ordinary least squares all the way through ensemble methods. Every section includes working Python code that you can adapt to your own datasets. By the end, you will understand not just how to fit these models, but how to choose between them and evaluate their performance with confidence.

What Is Regression and When Do You Use It

Regression is a supervised learning technique where the goal is to predict a continuous numeric value based on one or more input features. The model learns the relationship between the independent variables (features) and the dependent variable (target) from labeled training data, then applies that learned relationship to make predictions on new, unseen data.

You use regression whenever your target variable is a number on a continuous scale. Predicting the price of a house based on square footage, number of bedrooms, and location is a regression problem. Predicting whether a customer will churn (yes or no) is a classification problem. The distinction matters because different algorithms and different evaluation metrics apply to each.

Note

Logistic regression, despite its name, is a classification algorithm. It predicts probabilities that map to discrete classes rather than continuous values. This article focuses exclusively on regression models that output continuous numeric predictions.

The standard workflow for any regression task in Python follows a consistent pattern: import the necessary packages, prepare and split the data, create and fit the model, evaluate the results, and make predictions. This structure stays the same regardless of which regression algorithm you choose, which is one of the strengths of scikit-learn's unified API.

Simple and Multiple Linear Regression

Linear regression is the foundation of regression modeling. It assumes a linear relationship between the input features and the target variable. The model finds the best-fitting line (or hyperplane, in the case of multiple features) by minimizing the residual sum of squares between the observed values and the values predicted by the linear equation.

Simple linear regression involves a single predictor variable. The equation takes the form y = b0 + b1 * x, where b0 is the y-intercept and b1 is the slope coefficient. Multiple linear regression extends this to any number of predictor variables: y = b0 + b1*x1 + b2*x2 + ... + bn*xn.

Simple Linear Regression with scikit-learn

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(42)
X = np.random.rand(200, 1) * 10  # Single feature
y = 2.5 * X.squeeze() + np.random.randn(200) * 2 + 5

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print(f"Coefficient (slope): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

The coef_ attribute holds the learned slope coefficient, and intercept_ holds the y-intercept. The R-squared score tells you how much of the variance in the target variable is explained by the model. A value of 1.0 means a perfect fit, while 0.0 means the model does no better than simply predicting the mean of the target.

Multiple Linear Regression with statsmodels

When you need detailed statistical output, including p-values, confidence intervals, and diagnostic statistics, statsmodels is the better choice. It provides the kind of regression summary output familiar to anyone who has worked with statistical software.

import statsmodels.api as sm
import pandas as pd

# Create a DataFrame with multiple features
np.random.seed(42)
n = 300
data = pd.DataFrame({
    'sqft': np.random.randint(800, 3500, n),
    'bedrooms': np.random.randint(1, 6, n),
    'age': np.random.randint(0, 50, n),
})
data['price'] = (
    150 * data['sqft']
    + 20000 * data['bedrooms']
    - 1000 * data['age']
    + np.random.randn(n) * 30000
    + 50000
)

# Define features and target
X = data[['sqft', 'bedrooms', 'age']]
y = data['price']

# Add a constant term for the intercept
X_with_const = sm.add_constant(X)

# Fit the OLS model
ols_model = sm.OLS(y, X_with_const).fit()

# Print the full summary
print(ols_model.summary())

The summary output from statsmodels includes the R-squared value, adjusted R-squared, F-statistic, and individual coefficient p-values. These statistics help you determine which features are statistically significant predictors and whether the overall model is meaningful. A low p-value (typically below 0.05) for a coefficient suggests that the corresponding feature has a statistically significant relationship with the target.

Pro Tip

Use scikit-learn when your primary goal is prediction and model deployment. Use statsmodels when you need statistical inference, hypothesis testing, or a detailed regression summary for reporting purposes. They complement each other well in a typical data science workflow.

Polynomial Regression

Linear regression assumes a straight-line relationship between features and target. When the true relationship is curved, linear regression will underfit the data. Polynomial regression addresses this by introducing polynomial terms (squared, cubed, etc.) of the original features, allowing the model to capture nonlinear patterns while still using the linear regression framework internally.

In scikit-learn, you create polynomial features using PolynomialFeatures and then pass them to a standard LinearRegression model. The combination is typically wrapped in a Pipeline for clean, reproducible code.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Generate nonlinear data
np.random.seed(42)
X = np.sort(np.random.rand(100, 1) * 6, axis=0)
y = np.sin(X).ravel() + np.random.randn(100) * 0.15

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Compare different polynomial degrees
for degree in [1, 3, 5, 10]:
    poly_model = Pipeline([
        ('poly_features', PolynomialFeatures(degree=degree)),
        ('linear_reg', LinearRegression())
    ])
    poly_model.fit(X_train, y_train)
    y_pred = poly_model.predict(X_test)

    train_score = poly_model.score(X_train, y_train)
    test_score = r2_score(y_test, y_pred)

    print(f"Degree {degree:2d} | "
          f"Train R2: {train_score:.4f} | "
          f"Test R2: {test_score:.4f}")

Choosing the right polynomial degree is critical. A degree that is too low leads to underfitting, where the model cannot capture the true pattern in the data. A degree that is too high leads to overfitting, where the model memorizes noise in the training data and performs poorly on new data. You can typically spot overfitting when the training R-squared is very high but the test R-squared drops significantly.

Warning

High-degree polynomial regression is prone to severe overfitting and numerical instability. In practice, degrees above 4 or 5 rarely improve generalization. If you need to model complex nonlinear relationships, consider regularized models or tree-based methods instead.

Regularized Regression: Ridge, Lasso, and Elastic Net

When you have many features, or when features are highly correlated with each other (multicollinearity), standard linear regression can produce unstable coefficient estimates. Regularization addresses this by adding a penalty term to the loss function that discourages large coefficient values. This constraint forces the model to find simpler solutions that generalize better to new data.

Ridge Regression (L2 Regularization)

Ridge regression adds the sum of squared coefficients, multiplied by a tuning parameter alpha, to the ordinary least squares loss function. This penalty shrinks coefficient values toward zero but never forces them all the way to zero. Ridge is especially effective when you have many features that each contribute a small amount to the prediction.

from sklearn.linear_model import Ridge, RidgeCV

# Generate data with correlated features
np.random.seed(42)
n_samples = 200
X = np.random.randn(n_samples, 10)

# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(n_samples) * 0.1
X[:, 6] = X[:, 1] + np.random.randn(n_samples) * 0.1

y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Use cross-validation to find the best alpha
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)

print(f"Best alpha: {ridge_cv.alpha_}")
print(f"R-squared: {ridge_cv.score(X_test, y_test):.4f}")
print(f"Coefficients: {ridge_cv.coef_.round(3)}")

Lasso Regression (L1 Regularization)

Lasso regression uses the sum of absolute coefficient values as its penalty term. The key difference from Ridge is that Lasso can drive coefficients all the way to zero, effectively removing features from the model. This makes Lasso useful for feature selection in addition to regularization.

from sklearn.linear_model import Lasso, LassoCV

# Use cross-validation to find the best alpha
lasso_cv = LassoCV(alphas=None, cv=5, random_state=42)
lasso_cv.fit(X_train, y_train)

print(f"Best alpha: {lasso_cv.alpha_:.6f}")
print(f"R-squared: {lasso_cv.score(X_test, y_test):.4f}")
print(f"Coefficients: {lasso_cv.coef_.round(3)}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)} out of {len(lasso_cv.coef_)}")

When you examine the Lasso output, you will notice that some coefficients are exactly 0.0. The corresponding features have been removed from the model entirely. This is useful for building interpretable models and for identifying which features matter in your dataset.

Elastic Net (L1 + L2 Regularization)

Elastic Net combines both L1 and L2 penalties. It inherits Lasso's ability to perform feature selection while retaining Ridge's stability when dealing with correlated features. The l1_ratio parameter controls the balance between the two penalties: a value of 1.0 is pure Lasso, 0.0 is pure Ridge, and values in between give you a mix.

from sklearn.linear_model import ElasticNet, ElasticNetCV

# Search over both alpha and l1_ratio
elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],
    alphas=None,
    cv=5,
    random_state=42
)
elastic_cv.fit(X_train, y_train)

print(f"Best alpha: {elastic_cv.alpha_:.6f}")
print(f"Best l1_ratio: {elastic_cv.l1_ratio_}")
print(f"R-squared: {elastic_cv.score(X_test, y_test):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}")
Pro Tip

If you are unsure which regularization method to use, start with Elastic Net. It provides a good default by combining the strengths of both Ridge and Lasso. The cross-validated variants (RidgeCV, LassoCV, ElasticNetCV) automatically select the best hyperparameters for you.

Decision Tree and Random Forest Regression

Tree-based models take a fundamentally different approach to regression. Instead of fitting a mathematical equation, they recursively partition the feature space into regions and assign a predicted value (typically the mean of training samples) to each region. This makes them naturally able to capture nonlinear relationships and feature interactions without any manual feature engineering.

Decision Tree Regression

from sklearn.tree import DecisionTreeRegressor

# Fit a decision tree
tree_model = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=10,
    random_state=42
)
tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)
print(f"Decision Tree R-squared: {r2_score(y_test, y_pred_tree):.4f}")
print(f"Decision Tree RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_tree)):.4f}")

# Feature importance
for i, importance in enumerate(tree_model.feature_importances_):
    if importance > 0.01:
        print(f"  Feature {i}: {importance:.4f}")

The max_depth and min_samples_leaf parameters control the complexity of the tree. Without these constraints, a decision tree will grow until every leaf contains a single training sample, resulting in severe overfitting. Setting max_depth=5 limits the tree to five levels of splits, and min_samples_leaf=10 requires at least ten training samples in every terminal node.

Random Forest Regression

Random forests build many decision trees on random subsets of the data and average their predictions. This ensemble approach dramatically reduces overfitting compared to a single tree and typically produces more accurate and stable predictions.

from sklearn.ensemble import RandomForestRegressor

# Fit a random forest
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
print(f"Random Forest R-squared: {r2_score(y_test, y_pred_rf):.4f}")
print(f"Random Forest RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.4f}")

# Feature importance from the ensemble
importances = rf_model.feature_importances_
for i, imp in enumerate(importances):
    if imp > 0.01:
        print(f"  Feature {i}: {imp:.4f}")

The n_estimators parameter controls how many trees are in the forest. More trees generally improve performance up to a point, but also increase computation time. The n_jobs=-1 argument tells scikit-learn to use all available CPU cores for parallel training.

Evaluating and Comparing Regression Models

Choosing the right model requires consistent evaluation using appropriate metrics. The three metrics you will use in nearly every regression project are R-squared, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

R-squared (R2) measures how much of the variance in the target is explained by the model. Values closer to 1.0 indicate a better fit. However, R-squared alone can be misleading, especially when comparing models with different numbers of features.

Mean Absolute Error (MAE) gives you the average magnitude of prediction errors in the same units as the target variable. It is easy to interpret: an MAE of 5000 on a house price model means the predictions are off by $5,000 on average.

Root Mean Squared Error (RMSE) penalizes large errors more heavily than MAE because it squares the errors before averaging. If your use case treats large prediction errors as especially problematic, RMSE is the more informative metric.

from sklearn.metrics import mean_absolute_error

# Compare all models side by side
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (alpha=1.0)': Ridge(alpha=1.0),
    'Lasso (alpha=0.01)': Lasso(alpha=0.01),
    'Elastic Net': ElasticNet(alpha=0.01, l1_ratio=0.5),
    'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42),
    'Random Forest': RandomForestRegressor(
        n_estimators=100, max_depth=10, random_state=42, n_jobs=-1
    ),
}

print(f"{'Model':<25} {'R2':>8} {'MAE':>8} {'RMSE':>8}")
print("-" * 51)

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"{name:<25} {r2:>8.4f} {mae:>8.4f} {rmse:>8.4f}")
Note

Always evaluate on a held-out test set that the model has never seen during training. For more robust estimates, use cross-validation with cross_val_score from scikit-learn, which rotates through multiple train/test splits and averages the results.

Cross-Validation for Robust Evaluation

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation for each model
for name, model in models.items():
    scores = cross_val_score(
        model, X, y, cv=5, scoring='r2'
    )
    print(f"{name:<25} "
          f"Mean R2: {scores.mean():.4f} "
          f"(+/- {scores.std():.4f})")

Cross-validation gives you a more reliable performance estimate than a single train/test split. The standard deviation of the scores tells you how much the model's performance varies across different subsets of the data. A high standard deviation suggests the model is sensitive to which specific data points end up in the training set.

Complete Workflow Example

Here is a complete, end-to-end regression workflow that ties together everything covered in this article. It uses scikit-learn's built-in diabetes dataset, which contains ten baseline features measured from 442 diabetes patients, with a quantitative measure of disease progression one year after baseline as the target.

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet
)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Load data
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

print(f"Dataset shape: {X.shape}")
print(f"Target range: {y.min():.0f} to {y.max():.0f}\n")

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define models with preprocessing pipelines
pipelines = {
    'Linear Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LinearRegression())
    ]),
    'Ridge': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=1.0))
    ]),
    'Lasso': Pipeline([
        ('scaler', StandardScaler()),
        ('model', Lasso(alpha=0.1))
    ]),
    'Elastic Net': Pipeline([
        ('scaler', StandardScaler()),
        ('model', ElasticNet(alpha=0.1, l1_ratio=0.5))
    ]),
    'Random Forest': Pipeline([
        ('model', RandomForestRegressor(
            n_estimators=200, max_depth=8,
            min_samples_leaf=5, random_state=42,
            n_jobs=-1
        ))
    ]),
}

# Evaluate each model
print(f"{'Model':<20} {'CV R2':>8} {'Test R2':>9} "
      f"{'MAE':>8} {'RMSE':>8}")
print("-" * 57)

best_score = -np.inf
best_model_name = None

for name, pipeline in pipelines.items():
    # Cross-validation score
    cv_scores = cross_val_score(
        pipeline, X_train, y_train, cv=5, scoring='r2'
    )

    # Fit and evaluate on test set
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    test_r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    print(f"{name:<20} {cv_scores.mean():>8.4f} {test_r2:>9.4f} "
          f"{mae:>8.2f} {rmse:>8.2f}")

    if cv_scores.mean() > best_score:
        best_score = cv_scores.mean()
        best_model_name = name

print(f"\nBest model by CV R2: {best_model_name} ({best_score:.4f})")

This workflow standardizes features with StandardScaler inside a pipeline, which prevents data leakage by ensuring the scaler is fit only on training data. The Random Forest pipeline skips scaling because tree-based models are invariant to feature scaling. Cross-validation provides reliable performance estimates, and the final test set evaluation confirms how well the selected model generalizes.

Key Takeaways

  1. Start with linear regression. It is simple, interpretable, and often performs surprisingly well. Use it as your baseline before trying more complex models.
  2. Use regularization when you have many features. Ridge, Lasso, and Elastic Net prevent overfitting and handle multicollinearity. Lasso doubles as a feature selection tool by zeroing out unimportant coefficients.
  3. Polynomial regression handles curves, but use it carefully. Degrees above 4 or 5 frequently overfit. For complex nonlinear patterns, tree-based models are usually a better choice.
  4. Random forests are a strong default for nonlinear data. They resist overfitting, handle interactions automatically, and require minimal feature engineering. They trade interpretability for prediction accuracy.
  5. Always evaluate with cross-validation. A single train/test split can give misleading results. Cross-validation provides a more stable estimate of how your model will perform on unseen data.
  6. Choose metrics that match your problem. Use MAE when all errors matter equally. Use RMSE when large errors are especially costly. Use R-squared for a quick sense of overall model quality, but do not rely on it alone.

Regression modeling in Python is accessible, well-documented, and supported by a mature ecosystem of tools. The models covered in this guide, from linear regression through random forests, will handle the vast majority of real-world regression tasks. Master the workflow of fitting, evaluating, and comparing models, and you will be equipped to tackle any prediction problem that involves a continuous target variable.

back to articles