Logistic Regression is one of the foundational algorithms in machine learning, specifically designed for binary classification problems where the outcome falls into one of two categories. Despite its name containing the word "regression," it is a classification algorithm at its core. This article walks through the theory behind Logistic Regression and builds a complete, working example in Python using scikit-learn.
Binary classification is everywhere. Will a customer churn or stay? Is an email spam or legitimate? Does a patient have a disease or not? Logistic Regression gives you a reliable, interpretable way to answer these yes-or-no questions with data. It remains one of the go-to algorithms because it is fast, easy to interpret, and performs well on linearly separable data.
What Is Logistic Regression
Logistic Regression is a supervised learning algorithm that predicts the probability of an observation belonging to a particular class. While linear regression outputs a continuous value, Logistic Regression maps that output through a special function (the sigmoid function) to produce a value between 0 and 1. That value represents a probability, and a threshold (typically 0.5) determines the final class label.
In a binary classification scenario, the model learns a decision boundary that separates the two classes. The algorithm works by finding the best-fit line (or hyperplane in higher dimensions) that maximizes the likelihood of correctly classifying the training data. This optimization process is called maximum likelihood estimation.
Logistic Regression assumes a linear relationship between the input features and the log-odds of the target variable. If the relationship between features and the target is highly non-linear, consider tree-based models or neural networks instead.
The Sigmoid Function
The sigmoid function is the mathematical engine behind Logistic Regression. It takes any real-valued number and maps it to a value between 0 and 1. The formula is:
sigma(z) = 1 / (1 + e^(-z))
Where z is the linear combination of the input features and their learned weights: z = w0 + w1*x1 + w2*x2 + ... + wn*xn. When z is a large positive number, the sigmoid output approaches 1. When z is a large negative number, the output approaches 0. At z = 0, the output is exactly 0.5.
Here is how to implement and visualize the sigmoid function in Python:
import numpy as np
def sigmoid(z):
"""Compute the sigmoid of z."""
return 1 / (1 + np.exp(-z))
# Generate a range of values
z_values = np.linspace(-10, 10, 200)
probabilities = sigmoid(z_values)
# When z = 0, the probability is exactly 0.5
print(f"sigmoid(0) = {sigmoid(0)}") # 0.5
print(f"sigmoid(5) = {sigmoid(5):.4f}") # 0.9933
print(f"sigmoid(-5) = {sigmoid(-5):.4f}") # 0.0067
Preparing the Data
Before training a Logistic Regression model, the data needs to be properly prepared. This involves splitting the dataset into training and testing sets, and scaling the features so they all operate on a similar range. Feature scaling is particularly important for Logistic Regression because the algorithm uses gradient-based optimization, and features on wildly different scales can slow convergence or bias the model toward features with larger magnitudes.
Scikit-learn provides convenient tools for both of these steps. The train_test_split function handles the data split, and StandardScaler standardizes features by removing the mean and scaling to unit variance.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# X = your feature matrix, y = your target labels (0 or 1)
# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Always use stratify=y in train_test_split for classification tasks. This ensures both the training and testing sets maintain the same proportion of each class as the original dataset, which is critical when dealing with imbalanced classes.
Training the Model
With scikit-learn (version 1.8 as of this writing), training a Logistic Regression model requires just a few lines of code. The LogisticRegression class lives in sklearn.linear_model and uses the lbfgs solver by default, which works well for small to medium-sized datasets.
from sklearn.linear_model import LogisticRegression
# Create and train the model
model = LogisticRegression(random_state=42, max_iter=200)
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Get probability estimates
y_proba = model.predict_proba(X_test_scaled)
The predict method returns class labels (0 or 1), while predict_proba returns the probability for each class. The probability array has two columns: column 0 holds the probability of class 0, and column 1 holds the probability of class 1.
Key Parameters to Know
The LogisticRegression class has several parameters that control its behavior:
- C (default=1.0): Inverse of regularization strength. Smaller values mean stronger regularization, which helps prevent overfitting. Larger values let the model fit the training data more closely.
- solver (default='lbfgs'): The optimization algorithm. Options include
lbfgs,liblinear,newton-cg,newton-cholesky,sag, andsaga. For small datasets, the defaultlbfgsworks well. For very large datasets,sagorsagamay be faster. - max_iter (default=100): Maximum number of iterations for the solver to converge. Increase this if you get a convergence warning.
- class_weight (default=None): Set to
'balanced'when dealing with imbalanced datasets. This automatically adjusts weights inversely proportional to class frequencies.
In scikit-learn 1.8, the penalty parameter default has been deprecated and is transitioning to a new API. The default regularization is L2. If you need L1 regularization (for feature selection), use solver='saga' or solver='liblinear' with penalty='l1'. The multi_class parameter was also removed in version 1.7 and later.
Evaluating the Model
Accuracy alone does not tell the full story of a classification model's performance. Scikit-learn provides a comprehensive set of evaluation tools that give a more complete picture.
Confusion Matrix
A confusion matrix shows the counts of true positives, true negatives, false positives, and false negatives. It is the foundation for calculating precision, recall, and F1-score.
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
# Classification Report (precision, recall, F1-score)
report = classification_report(y_test, y_pred)
print(f"\nClassification Report:\n{report}")
ROC Curve and AUC Score
The ROC (Receiver Operating Characteristic) curve plots the true positive rate against the false positive rate at various threshold settings. The AUC (Area Under the Curve) score summarizes this curve as a single number between 0 and 1. A score of 1.0 indicates a perfect classifier, while 0.5 indicates a model that performs no better than random guessing.
from sklearn.metrics import roc_auc_score, roc_curve
# Calculate AUC score using probabilities of the positive class
y_proba_positive = model.predict_proba(X_test_scaled)[:, 1]
auc_score = roc_auc_score(y_test, y_proba_positive)
print(f"AUC Score: {auc_score:.4f}")
# Get ROC curve data points
fpr, tpr, thresholds = roc_curve(y_test, y_proba_positive)
Complete Working Example
The following example puts everything together using the Breast Cancer Wisconsin dataset, which is bundled with scikit-learn. This dataset contains 569 samples with 30 features each, and the task is to classify tumors as malignant (0) or benign (1).
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
classification_report,
roc_auc_score,
)
# --------------------------------------------------
# 1. Load the dataset
# --------------------------------------------------
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names
print(f"Dataset shape: {X.shape}")
print(f"Classes: {target_names}")
print(f"Class distribution: {np.bincount(y)}")
# --------------------------------------------------
# 2. Split into training and testing sets
# --------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# --------------------------------------------------
# 3. Scale features
# --------------------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# --------------------------------------------------
# 4. Train Logistic Regression
# --------------------------------------------------
model = LogisticRegression(
random_state=42,
max_iter=200,
C=1.0,
solver="lbfgs",
)
model.fit(X_train_scaled, y_train)
# --------------------------------------------------
# 5. Make predictions
# --------------------------------------------------
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
# --------------------------------------------------
# 6. Evaluate the model
# --------------------------------------------------
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"AUC Score: {roc_auc_score(y_test, y_proba):.4f}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred, target_names=target_names)}")
# --------------------------------------------------
# 7. Inspect learned coefficients
# --------------------------------------------------
print("\nTop 5 features by coefficient magnitude:")
coef_abs = np.abs(model.coef_[0])
top_indices = coef_abs.argsort()[-5:][::-1]
for idx in top_indices:
print(f" {feature_names[idx]}: {model.coef_[0][idx]:.4f}")
Running this code will produce output similar to the following:
# Sample output:
# Dataset shape: (569, 30)
# Classes: ['malignant' 'benign']
# Class distribution: [212 357]
#
# Accuracy: 0.9737
# AUC Score: 0.9974
#
# Confusion Matrix:
# [[41 2]
# [ 1 70]]
#
# Classification Report:
# precision recall f1-score support
# malignant 0.98 0.95 0.97 43
# benign 0.97 0.99 0.98 71
# accuracy 0.97 114
# macro avg 0.97 0.97 0.97 114
# weighted avg 0.97 0.97 0.97 114
The model achieves roughly 97% accuracy and a near-perfect AUC score on this dataset, demonstrating that Logistic Regression can be highly effective for well-structured binary classification tasks.
Key Takeaways
- Logistic Regression is a classification algorithm: Despite its name, it predicts discrete class labels by mapping linear outputs through the sigmoid function to produce probabilities between 0 and 1.
- Feature scaling matters: Because the algorithm relies on gradient-based optimization, standardizing features with
StandardScalerhelps the model converge faster and produce more reliable coefficients. - Use multiple evaluation metrics: Accuracy alone can be misleading, especially with imbalanced datasets. Always check precision, recall, F1-score, and the AUC score for a complete picture of model performance.
- Regularization prevents overfitting: The
Cparameter controls how strongly the model is regularized. Smaller values ofCapply more regularization. Use cross-validation to find the optimal value. - Interpretability is a strength: Unlike black-box models, Logistic Regression coefficients directly indicate how each feature influences the prediction, making it a strong choice when explainability is required.
Logistic Regression is a solid starting point for any binary classification problem. It trains quickly, produces interpretable results, and often serves as a baseline against which more complex models are measured. Master this algorithm first, and you will have a reliable tool in your machine learning toolkit that applies to a wide range of real-world scenarios.