Supervised learning is the backbone of practical machine learning. You give a model labeled data, it learns the patterns, and then it makes predictions on new inputs it has never seen before. Python's scikit-learn library makes this entire workflow remarkably straightforward, whether you are predicting house prices or classifying email as spam.
This article walks through the supervised learning models available in Python's scikit-learn library. Each section covers what a model does, when to use it, and how to implement it with working code. By the end, you will have a practical understanding of the tools you need to tackle classification and regression problems in your own projects.
What Is Supervised Learning
Supervised learning is a category of machine learning where you train a model using data that already has known outcomes. Each data point in your training set consists of input features (the variables you use to make predictions) and a target label (the answer you want the model to learn). The model studies the relationship between the features and the labels, then applies what it learned to predict outcomes on data it has not encountered before.
Think of it like a student studying with answer keys. The student reviews hundreds of solved problems, identifies the patterns, and then takes the exam on new questions. The quality of the student's performance depends on the quality and quantity of practice problems, and the same principle applies to supervised learning models.
Supervised learning splits into two main tasks: classification, where the model predicts a category, and regression, where the model predicts a continuous numerical value. Every model covered in this article handles one or both of these tasks.
Supervised learning stands in contrast to unsupervised learning, where the model works with unlabeled data and tries to discover hidden structure on its own. Clustering and dimensionality reduction are common unsupervised tasks.
Classification vs. Regression
Before choosing a model, you need to determine whether your problem is a classification task or a regression task. This decision shapes everything from the algorithm you select to the metrics you use to evaluate performance.
Classification predicts discrete categories. The output is a label, not a number. Examples include determining whether an email is spam or legitimate, identifying whether a medical scan shows a tumor or healthy tissue, and categorizing customer reviews as positive, neutral, or negative. When there are only two possible outcomes, the task is called binary classification. When there are three or more possible outcomes, it is called multiclass classification.
Regression predicts continuous numerical values. The output sits on a scale and can take any value within a range. Examples include forecasting the sale price of a house based on square footage and neighborhood, predicting how many units of a product will sell next quarter, and estimating a patient's blood pressure based on age, weight, and lifestyle factors.
Some algorithms, like decision trees and random forests, handle both classification and regression. Others, like logistic regression, are designed specifically for classification despite having "regression" in the name.
The Scikit-Learn Workflow
Scikit-learn follows a consistent pattern across every supervised learning model. Once you learn this workflow, you can apply it to any algorithm in the library. The steps are always the same: import the model class, create an instance, fit it on training data, and use it to make predictions.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create a model instance
model = LinearRegression()
# Fit the model on training data
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
This consistency is one of scikit-learn's greatest strengths. You can swap LinearRegression() for RandomForestClassifier() or SVC() and the rest of the code stays nearly identical. The fit(), predict(), and score() methods work the same way across every estimator.
Always split your data into training and test sets before fitting a model. Training and evaluating on the same data gives you an overly optimistic estimate of how well your model will perform on new, unseen data.
Linear Regression
Linear regression is the simplest supervised learning model for regression tasks. It assumes a linear relationship between the input features and the target variable, fitting a straight line (or a hyperplane in higher dimensions) that minimizes the sum of squared differences between predicted and actual values.
The model works by finding the coefficients (slopes) and intercept (y-intercept) that produce the best-fitting line. Scikit-learn uses ordinary least squares by default, which minimizes the residual sum of squares between observed and predicted values.
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate sample regression data
X, y = make_regression(
n_samples=200, n_features=3,
noise=10, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Inspect the learned parameters
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"R-squared: {model.score(X_test, y_test):.4f}")
Linear regression works well when the relationship between features and target is roughly linear, the features are not highly correlated with each other, and the dataset is free of significant outliers. When these assumptions break down, consider regularized variants like Ridge (L2 penalty) or Lasso (L1 penalty), which add a penalty term to prevent overfitting and handle multicollinearity.
from sklearn.linear_model import Ridge, Lasso
# Ridge regression with L2 regularization
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge R-squared: {ridge.score(X_test, y_test):.4f}")
# Lasso regression with L1 regularization
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Lasso R-squared: {lasso.score(X_test, y_test):.4f}")
Logistic Regression
Despite its name, logistic regression is a classification algorithm, not a regression one. It predicts the probability that a given input belongs to a particular class by applying the logistic (sigmoid) function to a linear combination of the input features. The sigmoid function maps any real-valued number to a value between 0 and 1, which is then interpreted as a probability.
If the predicted probability exceeds a threshold (typically 0.5), the model assigns the positive class. Otherwise, it assigns the negative class. Logistic regression extends naturally to multiclass problems using strategies like one-vs-rest or multinomial regression.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train logistic regression with multinomial strategy
model = LogisticRegression(
max_iter=200,
multi_class='multinomial',
solver='lbfgs'
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred,
target_names=iris.target_names))
Logistic regression is a strong starting point for classification tasks. It trains quickly, produces interpretable coefficients that indicate feature importance, and handles both binary and multiclass problems. It performs best when classes are roughly linearly separable in the feature space.
Decision Trees
A decision tree learns by repeatedly splitting the data based on feature values that best separate the target classes (for classification) or reduce prediction error (for regression). The result is a tree-like structure where each internal node represents a test on a feature, each branch represents the outcome of that test, and each leaf node holds a prediction.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a decision tree with controlled depth
tree = DecisionTreeClassifier(
max_depth=4,
min_samples_split=5,
random_state=42
)
tree.fit(X_train, y_train)
print(f"Training accuracy: {tree.score(X_train, y_train):.4f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.4f}")
# Show feature importances
for name, importance in zip(wine.feature_names,
tree.feature_importances_):
if importance > 0.05:
print(f" {name}: {importance:.4f}")
Decision trees are easy to understand and visualize. They handle both numerical and categorical features, require minimal data preprocessing, and naturally capture non-linear relationships. However, they are prone to overfitting, especially when allowed to grow without constraints. Setting parameters like max_depth, min_samples_split, and min_samples_leaf helps control tree complexity.
An unconstrained decision tree will memorize the training data perfectly, achieving 100% training accuracy while performing poorly on unseen data. Always set depth limits or pruning parameters to prevent this.
Random Forests
A random forest addresses the overfitting problem of individual decision trees by building an ensemble of many trees and combining their predictions. Each tree in the forest is trained on a random subset of the data (bootstrap sampling) and considers only a random subset of features at each split. This randomness ensures that the individual trees are diverse, which makes the combined prediction more robust and accurate.
For classification, the forest takes a majority vote among all trees. For regression, it averages the predictions.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
# Load breast cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a random forest
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
random_state=42,
n_jobs=-1 # Use all available CPU cores
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Top 5 important features
importances = rf.feature_importances_
indices = importances.argsort()[::-1][:5]
for i in indices:
print(f" {cancer.feature_names[i]}: {importances[i]:.4f}")
Random forests are one of the best general-purpose algorithms for tabular data. They resist overfitting better than single decision trees, handle missing values and mixed feature types gracefully, and provide reliable feature importance rankings. The main tradeoff is computational cost, as training hundreds of trees takes more time and memory than training a single model.
Support Vector Machines
Support Vector Machines (SVMs) find the hyperplane that separates classes with the maximum margin. The margin is the distance between the hyperplane and the nearest data points from each class, called support vectors. By maximizing this margin, SVMs tend to generalize well to unseen data.
SVMs become especially powerful with kernel functions, which project the data into higher-dimensional spaces where classes that were not linearly separable in the original space become separable. The radial basis function (RBF) kernel is the default and handles many non-linear classification tasks effectively.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(
n_samples=500, n_features=20,
n_informative=10, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# SVMs perform best with scaled features
svm_pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])
svm_pipeline.fit(X_train, y_train)
print(f"Accuracy: {svm_pipeline.score(X_test, y_test):.4f}")
SVMs excel in high-dimensional spaces and perform well when the number of features exceeds the number of samples. They are memory-efficient because the decision function depends only on the support vectors, not the entire training set. However, SVMs are sensitive to feature scaling, so always standardize or normalize your data before training. They also become slow to train on very large datasets, so consider LinearSVC or other alternatives when working with more than a few tens of thousands of samples.
Use Pipeline to bundle preprocessing steps (like scaling) with your model. This prevents data leakage by ensuring that the scaler is fitted only on the training data and then applied consistently to the test data.
Gradient Boosting
Gradient boosting builds an ensemble of weak learners (usually shallow decision trees) sequentially rather than in parallel. Each new tree focuses on the errors made by the previous trees, gradually reducing the overall prediction error. This iterative correction process makes gradient boosting one of the highest-performing algorithms for structured data.
Scikit-learn offers two gradient boosting implementations: GradientBoostingClassifier/GradientBoostingRegressor and the more recent HistGradientBoostingClassifier/HistGradientBoostingRegressor. The histogram-based versions are significantly faster on larger datasets because they bin continuous features into discrete intervals, reducing the number of split candidates the algorithm needs to evaluate.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, roc_auc_score
# Generate a larger dataset
X, y = make_classification(
n_samples=5000, n_features=30,
n_informative=15, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Histogram-based gradient boosting
hgb = HistGradientBoostingClassifier(
max_iter=200,
max_depth=6,
learning_rate=0.1,
random_state=42
)
hgb.fit(X_train, y_train)
y_pred = hgb.predict(X_test)
y_proba = hgb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")
The histogram-based variant also has native support for categorical features, which means you can pass categorical columns directly without encoding them first. Set categorical_features="from_dtype" and ensure your categorical columns use a pandas Categorical dtype.
Gradient boosting requires more careful tuning than random forests. The key hyperparameters are learning_rate (how much each tree contributes), max_iter or n_estimators (total number of trees), and max_depth (complexity of each tree). A lower learning rate generally requires more trees but produces better results.
Model Evaluation and Selection
Choosing the right model is only half the battle. You also need reliable methods to measure how well your model performs and to tune its hyperparameters. Scikit-learn provides a rich set of tools for both.
Cross-Validation
A single train-test split can give misleading results if the split happens to be unusually favorable or unfavorable. Cross-validation addresses this by splitting the data into multiple folds, training and testing on each combination, and averaging the results.
from sklearn.model_selection import cross_val_score
# 5-fold cross-validation on a random forest
scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42),
X, y, cv=5, scoring='accuracy'
)
print(f"Mean accuracy: {scores.mean():.4f}")
print(f"Standard deviation: {scores.std():.4f}")
Hyperparameter Tuning
Every model has hyperparameters that control its behavior but are not learned from the data. GridSearchCV exhaustively searches through a specified grid of parameter values, while RandomizedSearchCV samples random combinations, which is more efficient when the search space is large.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions to sample from
param_distributions = {
'n_estimators': randint(50, 300),
'max_depth': randint(3, 15),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10)
}
# Randomized search with cross-validation
search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=50,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
search.fit(X_train, y_train)
print(f"Best parameters: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")
print(f"Test score: {search.score(X_test, y_test):.4f}")
Key Metrics
Accuracy alone is often insufficient, especially when classes are imbalanced. For classification, consider precision (how many positive predictions were correct), recall (how many actual positives were caught), F1-score (the harmonic mean of precision and recall), and ROC AUC (the area under the receiver operating characteristic curve). For regression, common metrics include mean squared error, mean absolute error, and R-squared.
What Is New in Scikit-Learn 1.8
Scikit-learn version 1.8, released in December 2025, introduced several improvements that are relevant to supervised learning workflows.
The headline feature is expanded Array API support. This allows scikit-learn estimators and functions to work directly with PyTorch tensors and CuPy arrays, which means computations can run on GPUs without rewriting your code for a different library. Estimators that gained Array API support in this release include StandardScaler, PolynomialFeatures, RidgeCV, RidgeClassifierCV, GaussianMixture, CalibratedClassifierCV, and GaussianNB. Several metrics functions, including roc_curve, precision_recall_curve, and confusion_matrix, also received Array API support.
# Example: Running RidgeClassifierCV on GPU with PyTorch
import torch
from sklearn.linear_model import RidgeClassifierCV
from sklearn import config_context
# Move data to GPU as PyTorch tensors
X_gpu = torch.tensor(X_train, dtype=torch.float32, device="cuda")
y_gpu = torch.tensor(y_train, dtype=torch.float32, device="cuda")
# Enable array API dispatch
with config_context(array_api_dispatch=True):
ridge = RidgeClassifierCV(alphas=[0.1, 1.0, 10.0])
ridge.fit(X_gpu, y_gpu)
Version 1.8 also introduced temperature scaling as a probability calibration method in CalibratedClassifierCV. This technique is particularly effective for multiclass problems because it achieves calibrated probabilities with a single free parameter, unlike other methods that use a one-vs-rest scheme requiring additional parameters for each class.
The release also added support for free-threaded CPython (Python 3.14), with free-threaded wheels available for all supported platforms. This opens the door to more efficient multi-core CPU usage by allowing thread-based parallelism without the Global Interpreter Lock.
Key Takeaways
- Start simple, then iterate: Begin with a straightforward model like logistic regression or a basic decision tree to establish a baseline. Move to more complex models like random forests or gradient boosting only when the simpler model falls short.
- Scikit-learn's consistent API is your friend: Every supervised learning model in the library uses the same
fit(),predict(), andscore()interface. Learn the pattern once and apply it everywhere. - Always validate properly: Use cross-validation instead of relying on a single train-test split. Tune hyperparameters with
GridSearchCVorRandomizedSearchCVrather than guessing. - Match the model to the problem: Choose classification models for discrete outcomes and regression models for continuous values. Consider your dataset size, feature count, and interpretability requirements when selecting an algorithm.
- Preprocessing matters: Scale your features for SVMs and logistic regression. Use
Pipelineto chain preprocessing and modeling steps together, preventing data leakage and keeping your workflow clean. - GPU acceleration is here: With scikit-learn 1.8's expanded Array API support, you can now run a growing set of estimators on GPUs using PyTorch or CuPy tensors, with the potential for significant speedups on large datasets.
Supervised learning in Python has never been more accessible. Scikit-learn gives you a complete toolkit for building, evaluating, and tuning models across a wide range of problem types. The library continues to evolve with GPU support, improved performance, and new features, but the core workflow remains the same. Pick a model, fit it, evaluate it, and improve from there.