Principal Component Analysis (PCA) in Python

Working with datasets that have dozens or hundreds of features can slow down your models and make patterns difficult to spot. Principal Component Analysis (PCA) is one of the go-to techniques for reducing that dimensionality while keeping the information that matters. This article walks through PCA in Python from the ground up, covering the math behind it, how to implement it with scikit-learn, and how to interpret the results.

If you have ever trained a model on a dataset with 50 or more columns, you have probably noticed that training takes longer, overfitting becomes more likely, and it gets nearly impossible to visualize what is going on. PCA addresses all three of those problems by transforming your original features into a smaller set of new features called principal components. Each component captures as much of the remaining variance in the data as possible, and they are all uncorrelated with one another.

What PCA Actually Does

At its core, PCA is a linear algebra operation. It takes your original feature matrix and finds new axes (directions) along which the data varies the most. These new axes are the principal components. The first principal component points in the direction of greatest variance. The second principal component is perpendicular to the first and captures the next highest amount of variance. This pattern continues for as many components as there are original features.

Here is the sequence of operations PCA performs under the hood:

Center the data by subtracting the mean of each feature so the dataset is centered at the origin.
Compute the covariance matrix to understand how features relate to one another.
Calculate eigenvectors and eigenvalues of the covariance matrix. Each eigenvector defines a principal component direction, and its corresponding eigenvalue tells you how much variance that component explains.
Sort and select the eigenvectors by their eigenvalues in descending order, then keep only the top k components.
Project the data onto the selected components to produce a lower-dimensional representation.

Note

PCA is a statistical technique that only works with numeric data. If your dataset contains categorical columns, you need to encode them (for example, with one-hot encoding) before applying PCA. Also, PCA finds linear relationships. If the underlying structure in your data is nonlinear, consider Kernel PCA or t-SNE instead.

The mathematical elegance of PCA is that it uses Singular Value Decomposition (SVD) rather than explicitly computing the covariance matrix. Scikit-learn's implementation takes this approach, which is both faster and more numerically stable, especially for high-dimensional data.

Preparing Your Data for PCA

Data preparation is the step that determines whether PCA will give you meaningful results or misleading ones. The single most important requirement is feature scaling. PCA is variance-driven, which means features measured in larger units will dominate the principal components simply because their numeric range is bigger, not because they carry more information.

Consider a dataset with one feature measured in kilograms (values from 0 to 100) and another measured in milligrams (values from 0 to 100,000). Without scaling, PCA will treat the milligram feature as overwhelmingly more important. StandardScaler from scikit-learn solves this by transforming each feature to have a mean of 0 and a standard deviation of 1.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

# Load a real-world dataset with multiple features
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = wine.target

print(f"Original shape: {X.shape}")
print(f"Feature names: {list(X.columns)}")
print(f"\nBefore scaling (first 3 rows):")
print(X.head(3))

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\nAfter scaling (first 3 rows):")
print(pd.DataFrame(X_scaled, columns=X.columns).head(3))

The Wine dataset has 13 numeric features with very different scales. Alcohol content might range from 11 to 15, while proline values run from 278 to 1680. After applying StandardScaler, every feature sits on the same scale, and PCA can fairly evaluate which directions in the data carry the most variance.

Pro Tip

Always fit the scaler on your training data only, then use transform() on both training and test sets. Fitting on the full dataset before splitting introduces data leakage, which leads to overly optimistic performance estimates.

Implementing PCA with scikit-learn

Scikit-learn (version 1.8 as of this writing) provides sklearn.decomposition.PCA, which handles the heavy lifting. The class uses SVD internally and exposes several useful attributes after fitting.

from sklearn.decomposition import PCA

# Fit PCA with all components first to analyze variance
pca_full = PCA()
pca_full.fit(X_scaled)

# Examine explained variance for each component
print("Explained variance ratio per component:")
for i, var in enumerate(pca_full.explained_variance_ratio_):
    print(f"  PC{i+1}: {var:.4f} ({var*100:.2f}%)")

print(f"\nTotal variance explained: "
      f"{sum(pca_full.explained_variance_ratio_):.4f}")

Running this on the Wine dataset reveals how variance is distributed. The first principal component typically explains around 36% of the total variance, the second around 19%, and so on. By the time you reach the fifth or sixth component, you have captured well over 80% of all variance in the original 13 features.

Now apply PCA to reduce to a specific number of components:

# Reduce to 3 principal components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

print(f"Reduced shape: {X_pca.shape}")
print(f"Variance explained by 3 components: "
      f"{sum(pca.explained_variance_ratio_):.4f} "
      f"({sum(pca.explained_variance_ratio_)*100:.2f}%)")

# The components_ attribute shows how each original feature
# contributes to each principal component
components_df = pd.DataFrame(
    pca.components_,
    columns=wine.feature_names,
    index=[f'PC{i+1}' for i in range(3)]
)
print("\nComponent loadings (contribution of each feature):")
print(components_df.round(3))

The components_ attribute is your window into what each principal component actually represents. Each row is a component, and each value shows how much a particular original feature contributes to that component. Large absolute values indicate features that heavily influence a given component. This is how you maintain interpretability even after reducing dimensions.

Choosing the Right Number of Components

Picking the right number of components involves a tradeoff. Fewer components mean simpler models and faster training, but too few and you lose important patterns. There are three common approaches.

Cumulative Explained Variance Threshold

The most widely used method is to plot the cumulative explained variance and choose the number of components that crosses a threshold, typically 90% or 95%.

import matplotlib.pyplot as plt

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Find the number of components for 90% and 95% thresholds
n_90 = np.argmax(cumulative_variance >= 0.90) + 1
n_95 = np.argmax(cumulative_variance >= 0.95) + 1

print(f"Components needed for 90% variance: {n_90}")
print(f"Components needed for 95% variance: {n_95}")

# Plot the cumulative explained variance
plt.figure(figsize=(10, 5))
plt.bar(range(1, len(cumulative_variance) + 1),
        pca_full.explained_variance_ratio_,
        alpha=0.6, label='Individual')
plt.step(range(1, len(cumulative_variance) + 1),
         cumulative_variance, where='mid',
         label='Cumulative', color='#e06c75')
plt.axhline(y=0.90, color='#98c379', linestyle='--',
            label='90% threshold')
plt.axhline(y=0.95, color='#FFD43B', linestyle='--',
            label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA Explained Variance')
plt.legend()
plt.tight_layout()
plt.savefig('pca_variance_plot.png', dpi=150)
plt.show()

Fractional n_components

Scikit-learn provides a shortcut. Instead of specifying an integer, you can pass a float between 0 and 1 to n_components. PCA will automatically select the minimum number of components needed to explain that fraction of variance.

# Automatically select components for 95% variance
pca_auto = PCA(n_components=0.95)
X_auto = pca_auto.fit_transform(X_scaled)

print(f"Components selected: {pca_auto.n_components_}")
print(f"Shape after PCA: {X_auto.shape}")

Minka's MLE

For a fully automated approach, scikit-learn supports n_components='mle', which uses a maximum likelihood estimation method published by Thomas Minka to infer the intrinsic dimensionality of the data. This approach works well when you want to let the algorithm decide without setting a threshold.

# Let MLE determine the optimal number of components
pca_mle = PCA(n_components='mle', svd_solver='full')
X_mle = pca_mle.fit_transform(X_scaled)

print(f"MLE selected {pca_mle.n_components_} components")

Note

The 'mle' option requires svd_solver='full' to be set explicitly. If you leave the solver on 'auto', scikit-learn will switch to the full solver automatically, but being explicit avoids confusion.

Visualizing PCA Results

One of the biggest practical benefits of PCA is the ability to visualize high-dimensional data in 2D or 3D. When you reduce a 13-feature dataset to 2 components, you can plot every data point on a standard scatter plot and see clustering patterns that were invisible before.

import matplotlib.pyplot as plt

# Reduce to 2 components for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

# Create a scatter plot colored by wine class
fig, ax = plt.subplots(figsize=(10, 7))
colors = ['#306998', '#FFD43B', '#98c379']
target_names = wine.target_names

for i, (color, name) in enumerate(zip(colors, target_names)):
    mask = y == i
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1],
               c=color, label=name, alpha=0.7,
               edgecolors='white', linewidth=0.5, s=60)

ax.set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
ax.set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
ax.set_title('Wine Dataset - PCA Projection')
ax.legend()
plt.tight_layout()
plt.savefig('pca_scatter.png', dpi=150)
plt.show()

The resulting plot typically shows three distinct clusters corresponding to the three wine classes, even though you have compressed 13 features down to just 2. The axis labels include the percentage of variance each component explains, which gives viewers immediate context about how much information the visualization retains.

Examining Feature Contributions with a Biplot

A biplot overlays the original feature vectors on top of the PCA scatter plot. This lets you see which features drive separation along each component.

fig, ax = plt.subplots(figsize=(12, 8))

# Plot the data points
for i, (color, name) in enumerate(zip(colors, target_names)):
    mask = y == i
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1],
               c=color, label=name, alpha=0.5, s=40)

# Overlay feature vectors
loadings = pca_2d.components_.T
feature_names = wine.feature_names
scale_factor = 4  # Scale arrows for visibility

for j, (loading, name) in enumerate(zip(loadings, feature_names)):
    ax.arrow(0, 0, loading[0] * scale_factor,
             loading[1] * scale_factor,
             head_width=0.08, head_length=0.05,
             fc='#e06c75', ec='#e06c75', alpha=0.7)
    ax.text(loading[0] * scale_factor * 1.15,
            loading[1] * scale_factor * 1.15,
            name, fontsize=8, color='white', alpha=0.9)

ax.set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}%)')
ax.set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}%)')
ax.set_title('PCA Biplot - Wine Dataset')
ax.legend()
plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=150)
plt.show()

In the biplot, arrows pointing in the same direction as a cluster indicate that those features have high values for that group. Arrows that are long indicate features with strong influence on the principal components. Arrows close together suggest those features are correlated.

PCA in a Machine Learning Pipeline

PCA is rarely used in isolation. It is typically one step in a larger workflow that includes scaling, dimensionality reduction, and model training. Scikit-learn's Pipeline class makes this clean and prevents data leakage by ensuring that fitting happens only on training data.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build a pipeline: Scale -> PCA -> Classify
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Cross-validate on training data
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5)
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} "
      f"(+/- {cv_scores.std():.4f})")

# Fit on full training set and evaluate on test set
pipe.fit(X_train, y_train)
test_score = pipe.score(X_test, y_test)
print(f"Test accuracy: {test_score:.4f}")

# Check how many components PCA selected
n_selected = pipe.named_steps['pca'].n_components_
print(f"PCA selected {n_selected} components")

This pipeline first standardizes the features, then applies PCA to retain 95% of the variance, and finally trains a logistic regression classifier. Because everything is wrapped in a Pipeline, the scaler and PCA are fit only on the training folds during cross-validation. This is the correct way to combine preprocessing and modeling.

Comparing Performance With and Without PCA

from sklearn.metrics import classification_report

# Pipeline WITHOUT PCA
pipe_no_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Pipeline WITH PCA
pipe_with_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Evaluate both
for name, pipe in [("Without PCA", pipe_no_pca),
                    ("With PCA", pipe_with_pca)]:
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    accuracy = pipe.score(X_test, y_test)
    print(f"\n{name} - Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred,
                                target_names=wine.target_names))

On small, well-structured datasets like Wine, the accuracy with and without PCA is often very close. The real gains from PCA show up on larger datasets with many features, where reduced dimensionality leads to faster training, less overfitting, and sometimes better generalization.

When to Use PCA and When to Skip It

PCA is a powerful tool, but it is not always the right choice. Here are the situations where it shines and where it falls short.

Use PCA when:

Your dataset has many correlated features. PCA excels at collapsing redundant information into fewer components.
You need to visualize high-dimensional data. Reducing to 2 or 3 components makes plotting possible.
Training time is a concern. Fewer features mean faster model fitting, especially with algorithms that scale poorly with dimensionality.
You want to reduce overfitting. By removing low-variance noise dimensions, PCA can help models generalize better.

Skip PCA when:

Feature interpretability matters. Principal components are linear combinations of original features, which makes them harder to explain to stakeholders.
Your features are already low-dimensional. Applying PCA to a dataset with 5 features rarely helps and adds unnecessary complexity.
The relationships in your data are nonlinear. Standard PCA only captures linear patterns. For nonlinear structure, look at Kernel PCA (sklearn.decomposition.KernelPCA) or autoencoders.
You are working with tree-based models. Random forests and gradient boosting handle high-dimensional data well on their own and do not benefit much from PCA.

Pro Tip

If your dataset is too large to fit in memory, use sklearn.decomposition.IncrementalPCA, which processes data in mini-batches. This is especially useful when working with datasets stored on disk or streamed from a database.

Key Takeaways

PCA reduces dimensionality by finding new axes that capture maximum variance. It transforms correlated features into uncorrelated principal components, making datasets smaller without losing critical information.
Always standardize your features before applying PCA. Without scaling, features with larger numeric ranges will dominate the results regardless of their actual importance.
Use explained variance ratios to choose the number of components. A cumulative threshold of 90-95% is standard, but scikit-learn also supports automatic selection via fractional values or Minka's MLE method.
Wrap PCA in a Pipeline. Combining scaling, PCA, and your model in a scikit-learn Pipeline prevents data leakage and keeps your code clean and reproducible.
PCA is not universal. It works best with correlated numeric features and linear relationships. For nonlinear data or when interpretability is critical, consider alternatives like Kernel PCA, t-SNE, or UMAP.

PCA remains one of the foundational techniques in data science and machine learning. Whether you are compressing image data, preprocessing features for a classifier, or simply trying to visualize a complex dataset, understanding PCA gives you a reliable tool for cutting through high-dimensional noise. The scikit-learn implementation makes it straightforward to apply, and the Pipeline integration ensures it fits cleanly into production-ready workflows.