Linear Discriminant Analysis (LDA) is a supervised machine learning technique that serves a dual purpose: it classifies data into predefined groups and reduces the number of features in a dataset while preserving the information that separates those groups. This article walks through LDA from concept to working Python code using scikit-learn.
Whether you are building a classifier for medical diagnosis, filtering spam, or reducing a high-dimensional feature set before feeding it into another model, LDA is a technique worth having in your toolkit. It is computationally efficient, works well with small datasets, and provides interpretable results. In scikit-learn (version 1.8.0 as of this writing), LDA is implemented through the LinearDiscriminantAnalysis class in the sklearn.discriminant_analysis module.
What Is Linear Discriminant Analysis
Linear Discriminant Analysis works by finding a linear combination of features that best separates two or more classes. It does this by maximizing the ratio of between-class variance to within-class variance. In plain terms, LDA looks for the projection that pushes different classes as far apart as possible while keeping data points within each class tightly clustered.
The algorithm follows a generative model framework. It models the distribution of input features for each class and then uses Bayes' theorem to assign new data points to the class with the highest posterior probability. LDA assumes that each class shares the same covariance matrix and that the features follow a multivariate Gaussian distribution.
LDA produces at most n_classes - 1 discriminant components. For a binary classification problem, that means a single component. For a dataset with four classes, you can extract up to three components.
The core math behind LDA involves computing two matrices: the within-class scatter matrix (SW) and the between-class scatter matrix (SB). LDA then solves the generalized eigenvalue problem SW-1SB to find the directions (eigenvectors) that maximize class separation. The eigenvectors with the largest eigenvalues correspond to the discriminant axes that carry the strongest class-separating power.
How LDA Differs from PCA
Both LDA and PCA (Principal Component Analysis) are linear transformation techniques used for dimensionality reduction, but they solve fundamentally different problems. PCA is unsupervised. It finds directions of maximum variance in the data without considering class labels. LDA is supervised. It finds directions that maximize the separation between known classes.
This distinction matters in practice. PCA works well when the goal is general-purpose compression of a feature set. LDA works better when the goal is classification, because it directly optimizes for the boundary between classes. If two classes overlap heavily along the axis of maximum variance but separate cleanly along a different axis, PCA will miss the useful direction while LDA will find it.
Use PCA when you need unsupervised compression without class labels. Use LDA when you have labeled data and your end goal is classification or class-aware dimensionality reduction.
LDA for Classification
The following example demonstrates using LDA as a standalone classifier on the Iris dataset. The Iris dataset contains 150 samples across three species, each described by four features: sepal length, sepal width, petal length, and petal width.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create and train the LDA classifier
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
# Make predictions
y_pred = lda.predict(X_test)
# Evaluate performance
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
In this example, LinearDiscriminantAnalysis() is instantiated with its default settings, which uses the SVD (Singular Value Decomposition) solver. The model learns the class means and shared covariance structure from the training data during the fit() call, then applies Bayes' theorem to classify each test sample during predict().
The output will show high accuracy on the Iris dataset, typically above 97%, because the classes are well-separated in the original feature space.
LDA for Dimensionality Reduction
Beyond classification, LDA is commonly used as a preprocessing step to reduce the number of features before feeding data into another classifier. This is especially valuable when working with high-dimensional datasets where training time and overfitting are concerns.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the Wine dataset (13 features, 3 classes)
wine = load_wine()
X = wine.data
y = wine.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Apply LDA for dimensionality reduction (3 classes -> max 2 components)
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
X_test_lda = lda.transform(X_test_scaled)
print(f"Original feature count: {X_train_scaled.shape[1]}")
print(f"Reduced feature count: {X_train_lda.shape[1]}")
# Train a logistic regression classifier on the reduced features
clf = LogisticRegression(random_state=42)
clf.fit(X_train_lda, y_train)
y_pred = clf.predict(X_test_lda)
print(f"\nAccuracy with LDA reduction: {accuracy_score(y_test, y_pred):.4f}")
# Compare against logistic regression on the full feature set
clf_full = LogisticRegression(random_state=42, max_iter=1000)
clf_full.fit(X_train_scaled, y_train)
y_pred_full = clf_full.predict(X_test_scaled)
print(f"Accuracy without LDA: {accuracy_score(y_test, y_pred_full):.4f}")
The Wine dataset has 13 features and 3 classes. LDA reduces those 13 features down to just 2 discriminant components (the maximum for 3 classes). Despite this dramatic reduction from 13 dimensions to 2, the accuracy often remains comparable to or even exceeds the full-feature classifier, because LDA preserves exactly the information that matters for separating the classes.
Feature scaling is not strictly required for LDA to work, but it is recommended when features have very different scales. Standardizing the features to zero mean and unit variance helps the covariance estimation step produce more stable results.
Examining the Explained Variance Ratio
After fitting LDA, you can inspect how much of the discriminative information each component captures through the explained_variance_ratio_ attribute.
# After fitting LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
# Check how much discriminative power each component holds
for i, ratio in enumerate(lda.explained_variance_ratio_):
print(f"Component {i + 1}: {ratio:.4f} ({ratio * 100:.1f}%)")
print(f"Total: {sum(lda.explained_variance_ratio_):.4f}")
In many datasets, the first component captures the overwhelming majority of the class-separating information. This tells you whether reducing to a single component might be sufficient for your task.
Tuning LDA Parameters
The LinearDiscriminantAnalysis class in scikit-learn exposes several parameters that control how the model computes and applies the discriminant transformation.
Solver
The solver parameter determines the algorithm used to compute the discriminant directions. Three options are available:
svd(default) — Uses Singular Value Decomposition. Does not require computing the covariance matrix, making it suitable for datasets with a large number of features. Does not support shrinkage.lsqr— Uses a least-squares solution. Efficient and supports shrinkage. A good choice when you need regularization.eigen— Uses eigenvalue decomposition of the covariance matrix. Supports shrinkage and also computes the covariance matrix, which can be stored for inspection.
# Using the lsqr solver with automatic shrinkage
lda_lsqr = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto')
lda_lsqr.fit(X_train, y_train)
# Using the eigen solver
lda_eigen = LinearDiscriminantAnalysis(solver='eigen', shrinkage='auto')
lda_eigen.fit(X_train, y_train)
Prior Probabilities
By default, LDA estimates the prior probability of each class from the training data (i.e., the proportion of samples in each class). You can override this with the priors parameter if you know the true class distribution differs from your training set.
# Set equal priors for all three classes
lda_equal = LinearDiscriminantAnalysis(priors=[1/3, 1/3, 1/3])
lda_equal.fit(X_train, y_train)
LDA with Shrinkage
When you have a small number of training samples relative to the number of features, the estimated covariance matrix can be unreliable. Shrinkage addresses this by pulling the covariance estimate toward a structured target (typically a diagonal matrix), which stabilizes the model.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
# Generate a dataset where shrinkage helps:
# many features, few samples
X, y = make_classification(
n_samples=100,
n_features=50,
n_informative=10,
n_classes=3,
n_clusters_per_class=1,
random_state=42
)
# Standard LDA (no shrinkage)
lda_standard = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None)
scores_standard = cross_val_score(lda_standard, X, y, cv=5)
print(f"No shrinkage: {scores_standard.mean():.4f} (+/- {scores_standard.std():.4f})")
# LDA with automatic shrinkage (Ledoit-Wolf)
lda_shrunk = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto')
scores_shrunk = cross_val_score(lda_shrunk, X, y, cv=5)
print(f"Auto shrinkage: {scores_shrunk.mean():.4f} (+/- {scores_shrunk.std():.4f})")
# LDA with manual shrinkage value
lda_manual = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=0.5)
scores_manual = cross_val_score(lda_manual, X, y, cv=5)
print(f"Shrinkage=0.5: {scores_manual.mean():.4f} (+/- {scores_manual.std():.4f})")
The shrinkage='auto' setting uses the Ledoit-Wolf lemma to automatically determine the optimal amount of regularization. This is the recommended approach in many cases. You can also pass a float between 0 and 1, where 0 means no shrinkage (the full empirical covariance) and 1 means maximum shrinkage (using only the diagonal of the covariance matrix).
Shrinkage is only available with the lsqr and eigen solvers. Attempting to use shrinkage with the default svd solver will raise an error.
Using a Custom Covariance Estimator
For even more control over how the covariance matrix is estimated, scikit-learn allows you to pass a custom covariance estimator through the covariance_estimator parameter. This accepts any estimator with a fit method and a covariance_ attribute, such as those in sklearn.covariance.
from sklearn.covariance import OAS
# Use the Oracle Approximating Shrinkage (OAS) estimator
lda_oas = LinearDiscriminantAnalysis(
solver='lsqr',
covariance_estimator=OAS()
)
lda_oas.fit(X_train, y_train)
y_pred = lda_oas.predict(X_test)
print(f"Accuracy with OAS covariance: {accuracy_score(y_test, y_pred):.4f}")
Key Takeaways
- Dual purpose: LDA functions as both a classifier and a dimensionality reduction technique. Use it standalone for classification or as a preprocessing step before another algorithm.
- Supervised by design: Unlike PCA, LDA uses class labels to find the projections that best separate your data. This makes it a stronger choice when class separation is the goal.
- Component limit: LDA produces at most
n_classes - 1components. Plan your pipeline accordingly when working with binary or low-class-count problems. - Shrinkage for stability: When your sample count is small relative to your feature count, use
shrinkage='auto'with thelsqroreigensolver to regularize the covariance estimate. - Solver selection: Use
svdfor large feature sets without shrinkage. Uselsqrfor an efficient solver with shrinkage support. Useeigenwhen you need the covariance matrix stored for analysis.
Linear Discriminant Analysis remains a reliable and interpretable tool for classification and dimensionality reduction. Its assumptions (shared covariance, Gaussian distributions) are strict on paper but often flexible enough in practice. For problems where classes have meaningfully different means, LDA frequently matches or outperforms more complex models while training in a fraction of the time.