Unsupervised Learning Models in Python: A Practical Guide

Unsupervised learning is the branch of machine learning that finds hidden structure in data without labeled outcomes. Unlike supervised learning where the model trains on input-output pairs, unsupervised models work with raw, unlabeled datasets to discover patterns, group similar observations, reduce complexity, and flag anomalies. This guide walks through six essential unsupervised learning models in Python using scikit-learn, complete with working code you can run immediately.

Every model in this article uses scikit-learn (version 1.8 at the time of writing), the standard Python library for machine learning. Before starting, make sure you have the required packages installed:

pip install scikit-learn numpy matplotlib

All of the examples below use synthetic datasets generated with scikit-learn's built-in utilities. This keeps the focus on understanding the algorithms themselves rather than on data loading or preprocessing.

What Makes Learning "Unsupervised"

In supervised learning, every training example comes with a label. A spam classifier, for instance, trains on emails that have already been tagged as "spam" or "not spam." The model learns the mapping between features and labels, then applies that mapping to new data.

Unsupervised learning removes the labels entirely. The algorithm receives only the input features and must find meaningful structure on its own. There is no "correct answer" to check against during training. Instead, the model identifies natural groupings, compresses high-dimensional data into fewer dimensions, or detects data points that deviate from normal patterns.

The three primary categories of unsupervised learning are:

  • Clustering — Grouping data points that share similar characteristics. Customer segmentation and document categorization are common applications.
  • Dimensionality reduction — Compressing data into fewer features while preserving the information that matters. This is used for visualization, noise removal, and speeding up downstream models.
  • Anomaly detection — Identifying data points that do not conform to the expected pattern. Fraud detection and network intrusion detection rely on this technique.

Clustering: K-Means

K-Means is the simplest and most widely used clustering algorithm. It partitions data into k clusters by iteratively assigning each point to the nearest cluster center (centroid), then recalculating the centroids based on the new assignments. This process repeats until the centroids stabilize.

Note

K-Means requires you to specify the number of clusters k in advance. If you do not know how many clusters exist in your data, use the elbow method or silhouette analysis (both demonstrated below) to estimate it.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate synthetic data with 4 distinct clusters
X, y_true = make_blobs(
    n_samples=400,
    centers=4,
    cluster_std=0.8,
    random_state=42
)

# Elbow method: test a range of k values
inertias = []
k_range = range(2, 10)

for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 4))
plt.plot(k_range, inertias, marker="o", linewidth=2)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("Elbow Method for Optimal k")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("kmeans_elbow.png", dpi=150)
plt.show()

The elbow curve shows how inertia (the total distance of points from their assigned centroid) decreases as k increases. The optimal k sits at the "elbow" where the rate of decrease slows sharply. In this example, the elbow appears at k=4, which matches the four clusters in the synthetic data.

# Fit K-Means with optimal k
kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
labels = kmeans.fit_predict(X)

# Evaluate clustering quality
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.3f}")

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis", s=30, alpha=0.7)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    c="red", marker="X", s=200, edgecolors="black",
    label="Centroids"
)
plt.title("K-Means Clustering (k=4)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("kmeans_clusters.png", dpi=150)
plt.show()

The silhouette score ranges from -1 to 1. Values close to 1 indicate that data points are well-matched to their own cluster and poorly matched to neighboring clusters. A score above 0.5 generally suggests reasonable clustering structure.

Pro Tip

K-Means assumes clusters are spherical and roughly equal in size. If your data has irregularly shaped or overlapping clusters, consider DBSCAN, HDBSCAN, or Gaussian Mixture Models instead.

Density-Based Clustering: DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach from K-Means. Rather than assuming a fixed number of spherical clusters, DBSCAN identifies regions of high point density separated by regions of low density. This allows it to find clusters of arbitrary shape and automatically detect outliers as "noise" points that do not belong to any cluster.

The algorithm relies on two parameters: eps (the maximum distance between two points for them to be considered neighbors) and min_samples (the minimum number of points required to form a dense region). Points that have at least min_samples neighbors within eps distance are classified as core points. Points reachable from core points but lacking enough neighbors themselves are border points. Everything else is noise.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

# Generate crescent-shaped data that K-Means would struggle with
X, y_true = make_moons(n_samples=300, noise=0.05, random_state=42)
X = StandardScaler().fit_transform(X)

# Fit DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Count clusters and noise points
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

# Visualize
plt.figure(figsize=(8, 6))
unique_labels = set(labels)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))

for label, color in zip(unique_labels, colors):
    if label == -1:
        color = "gray"
        marker = "x"
        alpha = 0.5
        label_name = "Noise"
    else:
        marker = "o"
        alpha = 0.7
        label_name = f"Cluster {label}"

    mask = labels == label
    plt.scatter(X[mask, 0], X[mask, 1], c=[color], marker=marker,
                alpha=alpha, s=30, label=label_name)

plt.title("DBSCAN Clustering on Crescent-Shaped Data")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("dbscan_clusters.png", dpi=150)
plt.show()

DBSCAN correctly separates the two crescent shapes without requiring you to specify the number of clusters beforehand. It also labels any isolated points as noise, which is a significant advantage in real-world datasets that contain outliers.

Important

DBSCAN's performance is highly sensitive to the eps parameter, and the right value depends entirely on the scale and density of your data. Always standardize your features before running DBSCAN. A k-distance plot (plotting the distance to the k-th nearest neighbor for each point, sorted in ascending order) can help you choose an appropriate eps value.

Hierarchical Density Clustering: HDBSCAN

HDBSCAN (Hierarchical DBSCAN) extends DBSCAN by eliminating the need to choose a fixed eps value. It effectively runs DBSCAN across all possible epsilon values and extracts the clusters that persist the longest, making it far more robust to parameter selection. HDBSCAN was added to scikit-learn in version 1.3 and remains available in the current 1.8 release.

The key advantage of HDBSCAN is its ability to handle clusters of varying densities. Traditional DBSCAN assumes that all clusters have roughly the same density, which breaks down when your data contains both tight groups and loose groups. HDBSCAN handles these mixed-density scenarios naturally.

from sklearn.cluster import HDBSCAN
from sklearn.datasets import make_blobs

# Create data with clusters of varying density
centers = [[0, 0], [5, 5], [10, 0]]
X, y_true = make_blobs(
    n_samples=[200, 100, 50],
    centers=centers,
    cluster_std=[0.5, 1.5, 0.3],
    random_state=42
)

# Fit HDBSCAN - no eps needed
hdbscan = HDBSCAN(min_cluster_size=15, min_samples=5)
labels = hdbscan.fit_predict(X)

# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

# Visualize with cluster probabilities
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    X[:, 0], X[:, 1],
    c=labels,
    cmap="viridis",
    s=30,
    alpha=0.7
)

# Mark noise points
noise_mask = labels == -1
plt.scatter(
    X[noise_mask, 0], X[noise_mask, 1],
    c="gray", marker="x", s=30, alpha=0.5, label="Noise"
)

plt.title("HDBSCAN Clustering (Varying Density Clusters)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("hdbscan_clusters.png", dpi=150)
plt.show()

The min_cluster_size parameter is intuitive: it sets the smallest grouping that HDBSCAN will treat as a real cluster rather than noise. Smaller values produce more clusters; larger values merge small groups into noise. For exploratory analysis, starting with a value between 10 and 50 typically works well.

Pro Tip

HDBSCAN also produces a probabilities_ attribute that gives each point a membership confidence score between 0 and 1. Points near cluster centers have high probabilities, while those on the edges have lower values. This soft assignment can be useful when you need to gauge how confident the algorithm is about each point's cluster membership.

Dimensionality Reduction: PCA

Principal Component Analysis (PCA) is the workhorse of dimensionality reduction. It transforms high-dimensional data into a new coordinate system where the axes (principal components) are ordered by how much variance they capture. The first component captures the direction of greatest variance, the second captures the next greatest orthogonal direction, and so on.

PCA is commonly used in two scenarios: reducing the number of features before feeding data into another model (which can speed up training and reduce overfitting), and compressing high-dimensional data down to 2 or 3 dimensions for visualization.

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Load the digits dataset (8x8 images = 64 features)
digits = load_digits()
X = digits.data
y = digits.target

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

# Fit PCA and examine explained variance
pca_full = PCA()
pca_full.fit(X_scaled)

# Plot cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(cumulative_variance) + 1),
         cumulative_variance, marker="o", markersize=3, linewidth=2)
plt.axhline(y=0.95, color="red", linestyle="--", label="95% variance")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA: Explained Variance vs. Number of Components")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("pca_variance.png", dpi=150)
plt.show()

# Find how many components capture 95% of variance
n_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Components needed for 95% variance: {n_95} (out of {X.shape[1]})")

For the digits dataset, PCA can typically compress 64 features down to around 25-30 components while retaining 95% of the total variance. That is a significant reduction that speeds up any downstream model.

# Reduce to 2 components for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

print(f"Variance explained by 2 components: "
      f"{sum(pca_2d.explained_variance_ratio_):.1%}")

# Scatter plot colored by digit label
plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    X_2d[:, 0], X_2d[:, 1],
    c=y, cmap="tab10", s=15, alpha=0.6
)
plt.colorbar(scatter, label="Digit")
plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)")
plt.title("Digits Dataset Projected onto First Two Principal Components")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("pca_2d.png", dpi=150)
plt.show()

Even with just two components, distinct groupings emerge in the scatter plot. Digits like 0 and 1 typically separate clearly, while visually similar digits (like 3, 5, and 8) overlap more. This kind of visualization is invaluable for understanding whether meaningful structure exists in your data before committing to a full modeling pipeline.

Soft Clustering: Gaussian Mixture Models

A Gaussian Mixture Model (GMM) assumes that the data is generated from a mixture of several Gaussian (normal) distributions, each with its own mean and covariance. Unlike K-Means, which makes hard assignments (each point belongs to exactly one cluster), GMMs produce soft assignments by calculating the probability that each point belongs to each cluster.

This probabilistic approach is useful when cluster boundaries are not clearly defined, or when you need to quantify uncertainty in cluster assignments. GMMs can also model elliptical clusters of different sizes and orientations, whereas K-Means is limited to spherical clusters of similar size.

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

# Generate overlapping clusters
X, y_true = make_blobs(
    n_samples=600,
    centers=3,
    cluster_std=[1.0, 1.5, 0.5],
    random_state=42
)

# Fit Gaussian Mixture Model
gmm = GaussianMixture(
    n_components=3,
    covariance_type="full",
    random_state=42
)
gmm.fit(X)

# Predict hard labels and soft probabilities
labels = gmm.predict(X)
probs = gmm.predict_proba(X)

# Print the probability distribution for the first 5 points
print("Cluster membership probabilities (first 5 points):")
print(f"{'Point':<8} {'Cluster 0':<12} {'Cluster 1':<12} {'Cluster 2':<12}")
for i in range(5):
    print(f"{i:<8} {probs[i, 0]:<12.4f} {probs[i, 1]:<12.4f} {probs[i, 2]:<12.4f}")
# Use BIC to determine optimal number of components
bic_scores = []
n_range = range(1, 8)

for n in n_range:
    gm = GaussianMixture(n_components=n, covariance_type="full",
                          random_state=42)
    gm.fit(X)
    bic_scores.append(gm.bic(X))

plt.figure(figsize=(8, 4))
plt.plot(n_range, bic_scores, marker="o", linewidth=2)
plt.xlabel("Number of Components")
plt.ylabel("BIC Score (lower is better)")
plt.title("Bayesian Information Criterion for GMM Component Selection")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("gmm_bic.png", dpi=150)
plt.show()

optimal_n = list(n_range)[np.argmin(bic_scores)]
print(f"Optimal number of components (by BIC): {optimal_n}")

The Bayesian Information Criterion (BIC) balances model fit against complexity. The number of components that produces the lowest BIC is typically the best choice. The Akaike Information Criterion (AIC), available via gmm.aic(X), offers an alternative that tends to favor slightly more complex models.

# Visualize with decision boundaries
x_min, x_max = X[:, 0].min() - 2, X[:, 0].max() + 2
y_min, y_max = X[:, 1].min() - 2, X[:, 1].max() + 2
xx, yy = np.meshgrid(
    np.linspace(x_min, x_max, 200),
    np.linspace(y_min, y_max, 200)
)
grid_points = np.column_stack([xx.ravel(), yy.ravel()])

# Get log-likelihood scores for the grid
Z = -gmm.score_samples(grid_points)
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, levels=20, cmap="Blues_r", alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis", s=20, alpha=0.7)
plt.title("Gaussian Mixture Model with Density Contours")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("gmm_contours.png", dpi=150)
plt.show()
Note

The covariance_type parameter controls the shape of the clusters that the model can represent. "full" allows each cluster to have its own unconstrained covariance matrix (elliptical clusters of any orientation). "tied" forces all clusters to share the same covariance. "diag" restricts clusters to axis-aligned ellipses. "spherical" makes each cluster a sphere, similar to K-Means.

Anomaly Detection: Isolation Forest

Isolation Forest takes a clever approach to anomaly detection. Instead of trying to model "normal" behavior and flagging deviations, it exploits the fact that anomalies are rare and different. The algorithm builds an ensemble of random decision trees, and since anomalies have unusual feature values, they tend to be isolated (separated from the rest of the data) in fewer splits. Points that require fewer splits to isolate receive higher anomaly scores.

from sklearn.ensemble import IsolationForest

# Generate normal data with some outliers
rng = np.random.RandomState(42)
X_normal = 0.3 * rng.randn(300, 2)
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_normal, X_outliers])

# Fit Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.06,  # expected proportion of outliers
    random_state=42
)
predictions = iso_forest.fit_predict(X)

# predictions: 1 = inlier, -1 = outlier
n_outliers = (predictions == -1).sum()
print(f"Detected outliers: {n_outliers} out of {len(X)} points")

# Get anomaly scores (lower = more anomalous)
scores = iso_forest.decision_function(X)

# Visualize
plt.figure(figsize=(8, 6))
inliers = predictions == 1
outliers = predictions == -1

plt.scatter(X[inliers, 0], X[inliers, 1], c="steelblue",
            s=20, alpha=0.6, label="Inlier")
plt.scatter(X[outliers, 0], X[outliers, 1], c="red",
            s=50, marker="x", label="Outlier")
plt.title("Isolation Forest Anomaly Detection")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("isolation_forest.png", dpi=150)
plt.show()

The contamination parameter tells the model what fraction of the data you expect to be anomalous. Setting it too low will miss real anomalies; setting it too high will flag normal points. When the true contamination rate is unknown, you can use the anomaly scores from decision_function() to rank points by anomalousness and apply your own threshold.

# Score distribution
plt.figure(figsize=(8, 4))
plt.hist(scores, bins=50, edgecolor="black", alpha=0.7)
plt.axvline(x=0, color="red", linestyle="--", label="Default threshold")
plt.xlabel("Anomaly Score (lower = more anomalous)")
plt.ylabel("Frequency")
plt.title("Distribution of Isolation Forest Anomaly Scores")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("isolation_scores.png", dpi=150)
plt.show()
Pro Tip

Isolation Forest works well with high-dimensional data and is computationally efficient because it uses random subsampling. For datasets with hundreds of features, it often outperforms distance-based methods like Local Outlier Factor, which suffer from the curse of dimensionality.

Choosing the Right Model

Selecting the appropriate unsupervised learning model depends on what you are trying to accomplish and what assumptions hold for your data. Here is a practical decision framework:

Use K-Means when you have a rough idea of how many clusters exist, the clusters are roughly spherical and similarly sized, and you need a fast, interpretable result. It scales well to large datasets.

Use DBSCAN when clusters have irregular shapes, you do not know how many clusters exist, and your data may contain noise. Be prepared to tune the eps parameter, ideally after standardizing features.

Use HDBSCAN when your data has clusters of varying density. It requires less parameter tuning than DBSCAN and is an excellent default for exploratory analysis. The min_cluster_size parameter is the primary knob to adjust.

Use PCA when you need to reduce the number of features for visualization, denoising, or to speed up another model. It is a preprocessing step rather than a final model in many workflows.

Use Gaussian Mixture Models when you need probabilistic cluster assignments, when clusters overlap significantly, or when you believe the data is generated by a mixture of Gaussian distributions. Use BIC to select the number of components.

Use Isolation Forest when you need to detect outliers or anomalies. It is especially effective in high-dimensional settings and does not require labeled examples of anomalies.

Key Takeaways

  1. No labels required: Unsupervised learning extracts structure from raw data. The trade-off is that evaluation is harder since there is no ground truth to measure against. Use metrics like silhouette score, BIC, or explained variance ratio to assess your results.
  2. Preprocessing matters: Feature scaling (standardization or normalization) is critical for distance-based algorithms like K-Means, DBSCAN, and HDBSCAN. Skipping this step is one of the common reasons for poor clustering results.
  3. Start simple, then refine: Begin with K-Means or PCA to get a baseline understanding of your data's structure. Move to more sophisticated models like HDBSCAN or GMMs when the simpler methods fall short.
  4. Combine techniques: Unsupervised methods work well together. Run PCA to reduce dimensionality, then cluster the reduced data with HDBSCAN. Use Isolation Forest to remove anomalies before clustering. These pipelines often produce better results than any single algorithm.
  5. scikit-learn has you covered: All six models in this article share the same fit / predict / fit_predict API. Once you learn the pattern, switching between algorithms takes only a few lines of code.

Unsupervised learning is one of the areas where Python's machine learning ecosystem truly shines. The algorithms are accessible, well-documented, and battle-tested across industries from finance to healthcare. The best way to build intuition is to experiment: generate synthetic datasets, adjust parameters, and observe how the models respond. Each algorithm has its strengths and blind spots, and the only reliable way to discover which one fits your problem is to run the code.

back to articles