Dimensionality Reduction Models in Python

When your dataset has hundreds or thousands of features, training models becomes slow, visualizing patterns becomes impossible, and overfitting becomes inevitable. Dimensionality reduction solves all three problems by compressing your data into fewer features while preserving the information that actually matters. This guide walks through the major techniques available in Python and shows you how to implement each one with working code.

Imagine trying to understand a spreadsheet with 500 columns. You cannot plot it. You cannot visually inspect it. And any model you train on it will likely memorize noise rather than learn real patterns. This is the curse of dimensionality -- as the number of features grows, data points become increasingly sparse, distances between samples lose their meaning, and algorithms struggle to generalize. Dimensionality reduction techniques address this by transforming your high-dimensional data into a compact representation, keeping the structure and relationships that matter while discarding redundancy and noise.

There are two broad families of techniques. Linear methods like PCA, SVD, and LDA assume that the meaningful structure in your data can be captured through linear combinations of the original features. Non-linear methods like t-SNE and UMAP make no such assumption and can preserve complex, curved relationships that linear methods would miss entirely. Both families have their place, and understanding when to reach for each one is a core skill in practical machine learning.

What Is Dimensionality Reduction

Dimensionality reduction is the process of taking a dataset with many features and producing a new dataset with fewer features, where each new feature is derived from the originals. The goal is to retain as much useful information as possible while reducing the total number of variables. This can mean preserving the overall variance in the data, maintaining the distances between data points, or maximizing the separability between known classes.

There are two distinct approaches to accomplishing this. Feature selection picks a subset of the original features and discards the rest. Feature extraction creates entirely new features by combining the originals through mathematical transformations. The techniques covered in this article all fall into the feature extraction category -- they generate new composite features that did not exist in the original dataset.

Note

The new features produced by dimensionality reduction are typically not human-interpretable. A principal component, for example, is a weighted sum of all original features. You gain computational efficiency and reduced overfitting, but you lose the ability to point at a single original feature and say "this is what matters."

Before applying any reduction technique, you should standardize your features. Techniques like PCA are sensitive to the scale of input variables -- a feature measured in thousands will dominate one measured in decimals, regardless of actual importance. Use StandardScaler from scikit-learn to center each feature to zero mean and unit variance before reduction.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Linear Techniques: PCA, SVD, and LDA

Linear dimensionality reduction methods find new axes in your data by computing linear combinations of the original features. They work well when the relationships between variables are approximately linear, and they tend to be fast and mathematically well-understood.

Principal Component Analysis (PCA)

PCA is the workhorse of dimensionality reduction. It finds the directions in the data that capture the largest amount of variance and projects the data onto those directions. The first principal component captures the single direction of greatest variance. The second captures the next greatest variance while being orthogonal to the first, and so on. By keeping only the top k components, you reduce the feature space to k dimensions while preserving as much total variance as possible.

Under the hood, PCA performs a Singular Value Decomposition of the centered data matrix. The resulting components are uncorrelated with each other, which can be an advantage for downstream models that assume feature independence.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load a high-dimensional dataset
X, y = load_digits(return_X_y=True)
# 64 features (8x8 pixel images)

# Standardize before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Original shape: {X.shape}")
print(f"Reduced shape:  {X_pca.shape}")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")

The explained_variance_ratio_ attribute tells you how much of the total variance each component captures. You can use this to decide how many components to keep. A common approach is to plot the cumulative explained variance and choose the number of components where the curve starts to flatten.

import numpy as np

# Find the number of components for 95% variance
pca_full = PCA(n_components=0.95)
X_reduced = pca_full.fit_transform(X_scaled)

print(f"Components needed for 95% variance: {pca_full.n_components_}")
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")
Pro Tip

When you pass a float between 0 and 1 to n_components, PCA automatically selects the minimum number of components needed to retain that fraction of the total variance. This is often more practical than guessing a fixed number of components.

Scikit-learn also provides KernelPCA for situations where the data has non-linear relationships. By applying a kernel function (such as RBF, polynomial, or sigmoid), KernelPCA can capture non-linear structure that standard PCA would miss.

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.01)
X_kpca = kpca.fit_transform(X_scaled)

print(f"Kernel PCA shape: {X_kpca.shape}")

Truncated Singular Value Decomposition (SVD)

Truncated SVD is closely related to PCA but has one key advantage: it works directly on sparse matrices without needing to center the data first. This makes it the preferred choice for text data represented as TF-IDF matrices, where the data is naturally sparse and centering would destroy the sparsity.

In natural language processing, Truncated SVD applied to term-document matrices is known as Latent Semantic Analysis (LSA). It uncovers latent topics by finding the directions of greatest variance in the term-document space.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Example with text data
documents = [
    "Python is a great programming language",
    "Machine learning uses Python extensively",
    "Data science requires statistical knowledge",
    "Deep learning models need large datasets",
    "Natural language processing analyzes text",
]

# Create sparse TF-IDF matrix
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(documents)

print(f"TF-IDF shape (sparse): {X_tfidf.shape}")

# Reduce dimensions with Truncated SVD
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

print(f"Reduced shape: {X_svd.shape}")
print(f"Variance explained: {sum(svd.explained_variance_ratio_):.2%}")
Note

Use PCA for dense numerical data and Truncated SVD for sparse data. Applying PCA to a sparse matrix will first convert it to a dense array, which can consume enormous amounts of memory and defeat the purpose of having a sparse representation.

Linear Discriminant Analysis (LDA)

Unlike PCA and SVD, which are unsupervised, Linear Discriminant Analysis is a supervised dimensionality reduction technique. It uses class labels to find the projection that maximizes the separation between classes while minimizing the spread within each class. This makes LDA particularly useful as a preprocessing step before classification.

LDA computes two scatter matrices: the within-class scatter matrix (how spread out the samples are within each class) and the between-class scatter matrix (how far apart the class centroids are). It then finds the directions that maximize the ratio of between-class variance to within-class variance.

One constraint to be aware of: LDA can produce at most c - 1 components, where c is the number of classes. If you have a binary classification problem, LDA can only reduce to a single dimension.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_wine

# Load dataset with 3 classes
X, y = load_wine(return_X_y=True)

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# LDA: reduce to 2 components (max is n_classes - 1 = 2)
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

print(f"Original shape: {X.shape}")
print(f"Reduced shape:  {X_lda.shape}")
print(f"Variance explained: {sum(lda.explained_variance_ratio_):.2%}")

Non-Linear Techniques: t-SNE and UMAP

Linear methods assume that the important structure in your data lies along flat planes and straight lines. Many real-world datasets have structure that curves, folds, and twists through high-dimensional space. Non-linear techniques can capture these complex relationships, making them particularly valuable for visualization and exploratory data analysis.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE converts the pairwise distances between data points into probabilities, then finds a low-dimensional arrangement of points that preserves those probability distributions as closely as possible. It excels at revealing clusters and local structure in the data, which is why it has become a standard tool for visualizing high-dimensional datasets.

The algorithm works in two stages. First, it computes pairwise similarities between all points in the high-dimensional space using Gaussian distributions. Second, it initializes a random low-dimensional embedding and iteratively adjusts point positions to minimize the difference between the high-dimensional and low-dimensional similarity distributions, using a Student's t-distribution in the lower space to allow for better separation of clusters.

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import numpy as np

# Load data
X, y = load_digits(return_X_y=True)

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    init="pca",
    random_state=42,
    n_iter=1000
)
X_tsne = tsne.fit_transform(X_scaled)

print(f"Original shape: {X.shape}")
print(f"t-SNE shape:    {X_tsne.shape}")
print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
Warning

t-SNE is designed for visualization, not as a general-purpose preprocessing step. It does not produce a reusable transformation -- you cannot apply a fitted t-SNE model to new data points. The distances and cluster sizes in a t-SNE plot are not directly interpretable. Do not draw conclusions about the relative density or distance between clusters based solely on a t-SNE visualization.

The perplexity parameter controls how t-SNE balances attention between local and global structure. Lower perplexity values (5-10) focus on very local neighborhoods, while higher values (30-50) consider broader relationships. Running t-SNE with several perplexity values and comparing the results is good practice.

# Compare different perplexity values
for perp in [5, 15, 30, 50]:
    tsne = TSNE(
        n_components=2,
        perplexity=perp,
        learning_rate="auto",
        init="pca",
        random_state=42
    )
    X_embedded = tsne.fit_transform(X_scaled)
    print(f"Perplexity {perp:2d} -> KL divergence: {tsne.kl_divergence_:.4f}")

Uniform Manifold Approximation and Projection (UMAP)

UMAP has become one of the go-to alternatives to t-SNE for non-linear dimensionality reduction. It is grounded in Riemannian geometry and topological data analysis, and operates on the assumption that the data lies on a locally connected manifold that is approximately uniformly distributed. In practice, UMAP tends to preserve both local cluster structure and broader global relationships better than t-SNE, and it runs significantly faster on large datasets.

The algorithm works by first constructing a weighted graph of nearest neighbors in the high-dimensional space, then optimizing a low-dimensional layout so that the graph structure is preserved as faithfully as possible. Two key hyperparameters control the output: n_neighbors determines the size of the local neighborhood considered (larger values preserve more global structure), and min_dist controls how tightly points can be packed in the embedding (lower values produce tighter clusters).

import umap
from sklearn.datasets import load_digits

# Load data
X, y = load_digits(return_X_y=True)

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply UMAP
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric="euclidean",
    random_state=42
)
X_umap = reducer.fit_transform(X_scaled)

print(f"Original shape: {X.shape}")
print(f"UMAP shape:     {X_umap.shape}")

Unlike t-SNE, UMAP supports a transform method, meaning you can fit a UMAP model on training data and then project new, unseen data points into the same embedding space. This makes UMAP viable as a preprocessing step in production pipelines, not just for one-off visualizations.

from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Fit on training data, transform test data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_umap = reducer.fit_transform(X_train)
X_test_umap = reducer.transform(X_test)

print(f"Train embedding: {X_train_umap.shape}")
print(f"Test embedding:  {X_test_umap.shape}")
Pro Tip

UMAP also supports supervised and semi-supervised dimensionality reduction. Pass your labels as the y parameter in the fit method, and UMAP will use class information to guide the embedding, often producing significantly better cluster separation for classification tasks.

UMAP is not included in scikit-learn. Install it separately with pip install umap-learn. Import it as import umap -- note that the package name on PyPI is umap-learn, not umap.

Choosing the Right Technique

There is no single best dimensionality reduction technique. The right choice depends on your data, your goal, and your computational constraints. Here is a practical decision framework.

Use PCA when your data is dense and numerical, you need a fast and deterministic reduction, you want to preserve global variance, or you need a reusable transformation for production pipelines. PCA is the default starting point for dimensionality reduction.

Use Truncated SVD when your data is sparse, such as TF-IDF matrices from text processing or one-hot encoded categorical features. It avoids the memory explosion that would occur if you converted a sparse matrix to dense before applying PCA.

Use LDA when you have labeled data and your primary goal is classification. LDA is the only technique here that is supervised, and it directly optimizes for class separation rather than general variance preservation.

Use t-SNE when you need high-quality 2D or 3D visualizations of cluster structure and you are working with a reasonably sized dataset (up to tens of thousands of samples). Be prepared to experiment with the perplexity parameter.

Use UMAP when you need non-linear reduction at scale, you want to embed new data after fitting, or you need to preserve both local and global structure. UMAP is generally faster than t-SNE and works well for both visualization and as a preprocessing step in modeling pipelines.

Note

A common and effective strategy is to chain techniques together. For example, apply PCA first to reduce from 1000 features down to 50, then apply UMAP or t-SNE to reduce from 50 to 2 for visualization. This two-stage approach combines the speed of PCA with the non-linear expressiveness of manifold methods.

Building a Reduction Pipeline

In practice, you rarely apply dimensionality reduction in isolation. It is part of a larger workflow that includes scaling, reduction, and modeling. Scikit-learn's Pipeline class lets you chain these steps together so that the entire sequence can be fitted, transformed, and cross-validated as a single unit.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits

# Load data
X, y = load_digits(return_X_y=True)

# Build pipeline: Scale -> Reduce -> Classify
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=0.95)),
    ("classifier", LogisticRegression(max_iter=5000, random_state=42))
])

# Cross-validate the entire pipeline
scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")

print(f"Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

You can use GridSearchCV to search over different numbers of components alongside your model hyperparameters, letting the cross-validation process determine the optimal level of dimensionality reduction for your specific task.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    "pca__n_components": [10, 20, 30, 40, 50],
    "classifier__C": [0.1, 1.0, 10.0]
}

# Search for best combination
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)
grid_search.fit(X, y)

print(f"Best params: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")

For UMAP in a pipeline, wrap it in a custom transformer or use the UMAP object directly since it follows the scikit-learn transformer API with fit, transform, and fit_transform methods.

import umap
from sklearn.ensemble import RandomForestClassifier

# UMAP-based pipeline
umap_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("umap", umap.UMAP(n_components=10, random_state=42)),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

scores = cross_val_score(umap_pipeline, X, y, cv=5, scoring="accuracy")
print(f"UMAP Pipeline Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

Key Takeaways

  1. Always standardize first: Scale your features before applying any reduction technique. Methods like PCA and LDA are sensitive to feature magnitudes, and unscaled data will produce misleading results.
  2. Start with PCA: It is fast, deterministic, and works well as a baseline. Use the explained variance ratio to determine how many components to retain, and consider passing a float to n_components for automatic selection.
  3. Match the technique to the data type: Use Truncated SVD for sparse matrices, LDA when you have class labels and want separation, t-SNE for visualization of moderate datasets, and UMAP when you need non-linear reduction at scale or the ability to transform new data.
  4. Chain techniques when appropriate: A PCA reduction to 50 components followed by UMAP to 2 dimensions is often more effective than applying either technique alone on very high-dimensional data.
  5. Use pipelines for reproducibility: Wrap your scaling, reduction, and modeling steps in a scikit-learn Pipeline. This prevents data leakage during cross-validation and makes your workflow clean and reproducible.

Dimensionality reduction is not just a preprocessing convenience -- it is a fundamental tool for understanding and working with high-dimensional data. By reducing feature counts, you speed up training, reduce overfitting, enable visualization, and often improve model accuracy by removing noise. The techniques covered here, from the straightforward reliability of PCA to the non-linear expressiveness of UMAP, give you a complete toolkit for handling datasets of any size and complexity.

back to articles