Real-world datasets often contain hundreds or thousands of features. While more data can mean richer insights, it also introduces sparsity, noise, and crushing computational overhead. Dimensionality reduction transforms high-dimensional data into a compact, lower-dimensional representation that preserves the patterns that matter -- and discards the redundancy that does not.
This article walks through four of the leading dimensionality reduction techniques available in Python's scientific computing ecosystem: PCA, LDA, t-SNE, and UMAP. Each serves a different purpose, operates under different mathematical assumptions, and shines in different scenarios. By the end, you will know how to implement all four using scikit-learn (version 1.8, released December 2025) and the umap-learn library, and -- just as important -- you will know when to reach for each one.
Why Reduce Dimensions
Every feature in a dataset is a dimension. An RGB image scaled to 32 x 32 pixels has 1,024 dimensions. A genomics dataset can easily reach tens of thousands. As dimensions grow, data points become increasingly sparse in the feature space, a phenomenon known as the curse of dimensionality. Distance-based algorithms like k-nearest neighbors begin to struggle because, in very high dimensions, the difference between the nearest and farthest neighbor shrinks toward zero. Separation between classes dissolves.
Dimensionality reduction addresses this by projecting data into a lower-dimensional subspace that captures the signal while shedding the noise. The practical benefits fall into four categories: faster training and inference times, reduced risk of overfitting, elimination of multicollinearity among correlated features, and the ability to visualize data that would otherwise exist in an unimaginable number of dimensions.
Dimensionality reduction broadly splits into two strategies. Feature selection keeps a subset of the original features (filter, wrapper, and embedded methods). Feature extraction creates entirely new features by transforming or combining the originals (PCA, t-SNE, UMAP, LDA). This article focuses on feature extraction.
Before applying any technique, standard practice is to scale the data so that features with larger numeric ranges do not dominate. The StandardScaler from scikit-learn centers each feature to zero mean and unit variance, which is what every example below assumes as a preprocessing step.
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
# Load the digits dataset: 1,797 samples, 64 features (8x8 images)
X, y = load_digits(return_X_y=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Classes: {len(set(y))} (digits 0-9)")
# Original shape: (1797, 64)
# Classes: 10 (digits 0-9)
Principal Component Analysis (PCA)
PCA is the workhorse of linear dimensionality reduction. It finds the directions -- called principal components -- along which the data varies the most, then projects every sample onto those directions. The first principal component captures the maximum variance, the second captures the maximum remaining variance orthogonal to the first, and so on. Mathematically, PCA performs an eigenvalue decomposition of the covariance matrix (or equivalently, a singular value decomposition of the centered data matrix).
Because the components are orthogonal, the resulting features are completely uncorrelated. This property alone makes PCA a powerful preprocessing step before algorithms that assume feature independence.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Reduce 64 dimensions to 2 for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance (2 components): "
f"{sum(pca.explained_variance_ratio_):.2%}")
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1],
c=y, cmap='tab10', s=12, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection of Digits Dataset')
plt.tight_layout()
plt.savefig('pca_digits.png', dpi=150)
plt.show()
Two components alone will typically capture around 28-30% of the total variance in the digits dataset. That may not sound like much, but the projection already separates several digit clusters visibly. To decide how many components to keep for a modeling pipeline, plot the cumulative explained variance and look for the "elbow" where adding more components yields diminishing returns.
# Find the number of components that capture 95% of variance
pca_full = PCA(n_components=0.95)
X_pca_95 = pca_full.fit_transform(X_scaled)
print(f"Components needed for 95% variance: "
f"{pca_full.n_components_}")
# Typically around 28-29 components (out of 64)
In scikit-learn 1.8, the PCA solver automatically selects the fastest algorithm based on the shape of your data. For datasets with fewer than 1,000 features and more than 10 times as many samples, it uses the "covariance_eigh" solver, which runs an eigenvalue decomposition on the covariance matrix instead of a full SVD. For larger matrices where you need fewer than 80% of the components, it switches to "randomized" SVD for significant speed gains. You can also pass n_components='mle' to let Minka's maximum likelihood estimation choose the optimal number of components automatically.
PCA works best when the relationships between features are roughly linear. If your data lies on a curved manifold -- imagine data distributed on the surface of a Swiss roll -- PCA will fail to unwrap the structure. For nonlinear data, Kernel PCA applies a kernel trick (commonly RBF, polynomial, or sigmoid) to project data into a higher-dimensional space where it becomes linearly separable, then applies standard PCA in that transformed space.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.01)
X_kpca = kpca.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_kpca[:, 0], X_kpca[:, 1],
c=y, cmap='tab10', s=12, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel('Kernel PC 1')
plt.ylabel('Kernel PC 2')
plt.title('Kernel PCA (RBF) Projection of Digits Dataset')
plt.tight_layout()
plt.savefig('kpca_digits.png', dpi=150)
plt.show()
Linear Discriminant Analysis (LDA)
Where PCA is unsupervised and ignores class labels entirely, LDA is a supervised technique. It finds the linear combinations of features that maximize the separation between known classes while minimizing the spread within each class. Formally, LDA maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix.
A key constraint of LDA is that it can produce at most C - 1 components, where C is the number of classes. For the digits dataset with 10 classes, that means a maximum of 9 components. This makes LDA particularly well-suited as a preprocessing step before classification tasks where you already have labeled training data.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1],
c=y, cmap='tab10', s=12, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title('LDA Projection of Digits Dataset')
plt.tight_layout()
plt.savefig('lda_digits.png', dpi=150)
plt.show()
Because LDA explicitly optimizes for class separation, it often produces cleaner clusters in two dimensions than PCA does -- especially when the classes are well-defined and the within-class distributions are approximately Gaussian with similar covariance structures. The trade-off is that LDA requires labeled data, so it cannot be used in unsupervised settings.
LDA assumes that the features within each class follow a Gaussian distribution with equal covariance across classes. When these assumptions hold, LDA is one of the most efficient dimensionality reduction techniques for classification pipelines. When they do not, consider Kernel PCA or manifold methods instead.
t-SNE for Visualization
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear technique designed specifically for visualizing high-dimensional data in two or three dimensions. It works by converting pairwise distances in the high-dimensional space into conditional probabilities (using a Gaussian distribution), then minimizing the Kullback-Leibler divergence between those probabilities and a similar set of probabilities in the low-dimensional space (using a heavy-tailed Student's t-distribution). The heavy tail in the low-dimensional distribution gives distant points more room to spread apart, which is what produces the tight, well-separated clusters t-SNE is known for.
The most important hyperparameter is perplexity, which roughly corresponds to the number of effective nearest neighbors. Values between 5 and 50 are typical, with 30 being the default in scikit-learn. Lower perplexity values emphasize very local structure; higher values incorporate more global context.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30,
learning_rate='auto', init='pca',
random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y, cmap='tab10', s=12, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE Projection of Digits Dataset')
plt.tight_layout()
plt.savefig('tsne_digits.png', dpi=150)
plt.show()
t-SNE is strictly a visualization tool. The relative sizes of clusters and the distances between them in a t-SNE plot are generally not meaningful -- do not interpret them as measures of similarity. t-SNE also lacks a transform method, so it cannot project new, unseen data points. If you need to transform new data, use PCA or UMAP instead.
A common strategy for large datasets is to first reduce dimensions with PCA (down to 50 components, for example), then apply t-SNE to the PCA-reduced data. This two-step pipeline dramatically speeds up t-SNE, which has an O(N^2) time complexity under the exact algorithm and O(N log N) under the Barnes-Hut approximation (the default in scikit-learn).
# Two-step approach: PCA then t-SNE
from sklearn.decomposition import PCA
# Step 1: Reduce to 50 dimensions with PCA
pca_50 = PCA(n_components=50)
X_pca_50 = pca_50.fit_transform(X_scaled)
# Step 2: Apply t-SNE to the PCA-reduced data
tsne = TSNE(n_components=2, perplexity=30,
learning_rate='auto', random_state=42)
X_tsne_fast = tsne.fit_transform(X_pca_50)
UMAP: Speed Meets Structure
Uniform Manifold Approximation and Projection (UMAP) is a more recent nonlinear technique grounded in algebraic topology and Riemannian geometry. Like t-SNE, it excels at visualization, but it offers several practical advantages: it runs significantly faster on large datasets, it preserves global structure more faithfully than t-SNE, and it exposes a transform method that can project new data points into the learned embedding space.
UMAP works in three stages. First, it constructs a weighted k-nearest-neighbor graph in the high-dimensional space. Second, it builds a similar graph in the low-dimensional target space. Third, it optimizes the low-dimensional layout so the two graphs match as closely as possible, using stochastic gradient descent on a cross-entropy objective function.
# Install: pip install umap-learn
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15,
min_dist=0.1, metric='euclidean',
random_state=42)
X_umap = reducer.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1],
c=y, cmap='tab10', s=12, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('UMAP Projection of Digits Dataset')
plt.tight_layout()
plt.savefig('umap_digits.png', dpi=150)
plt.show()
The two critical hyperparameters are n_neighbors and min_dist. The n_neighbors parameter controls the balance between local and global structure -- smaller values emphasize fine-grained local neighborhoods, while larger values capture broader patterns. The min_dist parameter controls how tightly UMAP packs points together in the embedding. Lower values create denser clusters; higher values spread data more evenly.
UMAP follows the scikit-learn API, making it a drop-in replacement for t-SNE in pipelines. Unlike t-SNE, UMAP supports fit and transform as separate steps, so you can fit on training data and transform test data -- a requirement for production ML workflows where you need consistent embeddings across data splits.
Because UMAP supports the transform method, it can be embedded in a scikit-learn pipeline alongside classifiers. Here is a complete example that reduces the digits dataset with UMAP and classifies the embeddings with a random forest:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42,
stratify=y)
reducer = umap.UMAP(n_components=10, n_neighbors=15,
min_dist=0.1, random_state=42)
X_train_umap = reducer.fit_transform(X_train)
X_test_umap = reducer.transform(X_test)
clf = RandomForestClassifier(n_estimators=200,
random_state=42)
clf.fit(X_train_umap, y_train)
y_pred = clf.predict(X_test_umap)
print(f"Accuracy on UMAP-reduced data: "
f"{accuracy_score(y_test, y_pred):.2%}")
Choosing the Right Technique
No single dimensionality reduction method is best for every situation. The right choice depends on whether your data has linear or nonlinear structure, whether you have labels, how large your dataset is, and whether your goal is visualization, preprocessing, or both.
PCA should be the starting point for almost any dimensionality reduction task. It is fast, deterministic, interpretable (each component has a clear explained variance ratio), and works well as a preprocessing step before other algorithms. Use it when features are linearly correlated, when you need to feed reduced data into a downstream model, or when you need a quick variance-based cutoff for the number of features to keep.
LDA is the right choice when you have labeled data and your goal is classification. It directly optimizes for class separability, which means the reduced features are inherently aligned with the prediction task. Its limitation to C - 1 components makes it unsuitable when you need more dimensions than you have classes, and the Gaussian equal-covariance assumption may not hold for every dataset.
t-SNE remains the gold standard for producing publication-quality visualizations of high-dimensional data, especially when revealing local cluster structure is the priority. Its drawbacks -- slow runtime, no ability to transform new points, and unreliable global distances -- mean it should be used strictly for exploratory visualization, not as a pipeline step before a classifier.
UMAP has emerged as a versatile alternative that covers many of the same use cases as t-SNE while also supporting transformation of new data. It scales better to large datasets, tends to preserve global structure more faithfully, and integrates cleanly with scikit-learn pipelines. For many practitioners, UMAP has become the default nonlinear technique for both visualization and preprocessing.
Key Takeaways
- Scale your data first. Every technique discussed here -- PCA, LDA, t-SNE, UMAP -- is sensitive to the relative scales of input features. Always apply
StandardScaleror equivalent normalization before reducing dimensions. - Start with PCA. It is fast, linear, and deterministic. Even if you plan to use a nonlinear method, running PCA first (down to 50 components, for example) can dramatically speed up t-SNE or UMAP while removing noise.
- Use LDA when you have labels and want classification-optimized features. It directly maximizes class separation, which can outperform PCA in supervised pipelines.
- Use t-SNE for visualization only. Do not feed t-SNE embeddings into classifiers. The technique is non-deterministic, cannot transform new data, and the distances in the embedding are not reliable.
- Use UMAP for both visualization and modeling. Its support for
transform, its speed advantage over t-SNE, and its better preservation of global structure make it a strong default for nonlinear dimensionality reduction in production workflows.
Dimensionality reduction is not just a performance optimization -- it is a lens for understanding the structure hidden in your data. Whether you are compressing satellite imagery, clustering gene expression profiles, or preparing features for a fraud detection model, these techniques let you strip away the noise and focus on the signal. The Python ecosystem, anchored by scikit-learn 1.8 and umap-learn, makes it straightforward to experiment with all four approaches and find the one that fits your problem.