Python Clustering Models: A Practical Guide with Scikit-Learn

Clustering is one of the foundational techniques in unsupervised machine learning. Unlike classification or regression, clustering algorithms work without labeled data -- they discover hidden structure by grouping similar data points together. Python's scikit-learn library (currently at version 1.8) provides a rich collection of clustering algorithms, each suited to different data shapes, densities, and scales. This guide walks through four of the most practical clustering models, complete with working code you can adapt to your own projects.

Whether you are segmenting customers, grouping documents by topic, or detecting anomalies in network traffic, clustering gives you a way to let the data reveal its own natural groupings. The challenge is knowing which algorithm to reach for and how to tune it. This article covers four algorithms that together handle the vast majority of real-world clustering tasks: K-Means, DBSCAN, HDBSCAN, and Agglomerative Clustering.

What Clustering Actually Does

Clustering is a form of unsupervised learning. You hand the algorithm a dataset with no labels, and it assigns each data point to a group (a cluster) based on similarity. Points within the same cluster are more similar to each other than to points in other clusters. The definition of "similar" depends on the algorithm and the distance metric you choose -- Euclidean distance is the default for many algorithms, but Manhattan distance, cosine similarity, and others are available.

Every clustering algorithm in scikit-learn follows the same basic interface. You create a model object, call .fit(X) on your data, and then access the resulting cluster labels through the .labels_ attribute. This consistency makes it straightforward to swap one algorithm for another when experimenting.

# The universal pattern for clustering in scikit-learn
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_
Note

Clustering algorithms generally expect numerical input. If your data contains categorical features, you will need to encode them first. Additionally, many clustering algorithms are sensitive to feature scale, so applying StandardScaler or MinMaxScaler before fitting the model is strongly recommended.

K-Means Clustering

K-Means is the most widely recognized clustering algorithm. It works by placing k centroids in feature space, assigning each data point to the nearest centroid, then repositioning the centroids to the mean of their assigned points. This process repeats until the centroids stabilize or the maximum number of iterations is reached.

The scikit-learn implementation uses the k-means++ initialization strategy by default, which spreads the initial centroids apart to avoid poor convergence. The algorithm supports two computation methods: "lloyd" (the classic EM-style approach) and "elkan", which can be faster on well-separated clusters by leveraging the triangle inequality.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Generate sample data with 4 distinct clusters
X, y_true = make_blobs(
    n_samples=500,
    centers=4,
    cluster_std=0.8,
    random_state=42
)

# Scale features before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-Means
kmeans = KMeans(
    n_clusters=4,
    init="k-means++",
    n_init="auto",
    max_iter=300,
    random_state=42
)
kmeans.fit(X_scaled)

print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")
print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Iterations to converge: {kmeans.n_iter_}")

The inertia_ attribute reports the within-cluster sum of squared distances. Lower inertia means tighter clusters, but it always decreases as you add more clusters. That is where the elbow method comes in -- plot inertia against increasing values of k and look for the point where the rate of decrease sharply levels off.

# Elbow method to find optimal k
inertias = []
k_range = range(2, 11)

for k in k_range:
    model = KMeans(n_clusters=k, n_init="auto", random_state=42)
    model.fit(X_scaled)
    inertias.append(model.inertia_)

# The "elbow" in the plot suggests the best k
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(k_range, inertias, marker="o", linewidth=2)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("elbow_plot.png", dpi=150)
plt.show()
Pro Tip

K-Means assumes clusters are roughly spherical and similarly sized. If your data contains elongated, irregular, or overlapping clusters, K-Means may produce misleading results. Consider running PCA to reduce dimensionality before applying K-Means to high-dimensional datasets -- this can both improve cluster quality and reduce computation time.

When K-Means Works Well

K-Means is an excellent first choice when you have a rough idea of how many clusters exist in your data, the clusters are roughly equal in size and density, and you need an algorithm that scales efficiently to large datasets. It handles hundreds of thousands of samples comfortably, and MiniBatchKMeans pushes that even further by processing data in small random batches.

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach. Instead of requiring a predefined number of clusters, it discovers clusters based on density. Points packed closely together form clusters, while isolated points in sparse regions get labeled as noise (assigned the label -1).

Two parameters control DBSCAN's behavior: eps defines the maximum distance between two points for them to be considered neighbors, and min_samples sets the minimum number of points required to form a dense region. Getting eps right is the critical tuning challenge -- too large and separate clusters merge; too small and valid clusters fragment into noise.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate crescent-shaped data that K-Means struggles with
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Scale the data
scaler = StandardScaler()
X_moons_scaled = scaler.fit_transform(X_moons)

# Fit DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan.fit(X_moons_scaled)

labels = dbscan.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

This example uses the make_moons dataset, which generates two interleaving crescent shapes. K-Means would split these along a straight boundary and get it wrong, but DBSCAN correctly identifies the two curved clusters because it follows density rather than distance to a centroid.

Note

DBSCAN is not well suited for datasets where clusters have significantly different densities. A single eps value cannot simultaneously capture tight clusters and loose ones. If your data has varying density, HDBSCAN is the better choice.

Tuning eps with a k-Distance Graph

A practical technique for choosing eps is to compute the distance to the k-th nearest neighbor for every point (where k equals your min_samples value), sort those distances, and plot them. The "knee" in the resulting curve suggests a good eps value.

from sklearn.neighbors import NearestNeighbors

# Compute distances to the 5th nearest neighbor
neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(X_moons_scaled)
distances, _ = neighbors.kneighbors(X_moons_scaled)

# Sort the distances to the 5th neighbor
sorted_distances = np.sort(distances[:, 4])

plt.figure(figsize=(8, 4))
plt.plot(sorted_distances, linewidth=2)
plt.xlabel("Points (sorted by distance)")
plt.ylabel("Distance to 5th Nearest Neighbor")
plt.title("k-Distance Graph for eps Selection")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("kdist_plot.png", dpi=150)
plt.show()

HDBSCAN: The Adaptive Upgrade

HDBSCAN (Hierarchical DBSCAN) solves DBSCAN's biggest limitation: the requirement for a single global density threshold. HDBSCAN essentially runs DBSCAN across all possible eps values and extracts the clusters that remain stable across the widest range of densities. This makes it far more effective on real-world datasets where cluster densities vary.

HDBSCAN was integrated directly into scikit-learn starting with version 1.3, so it is available as sklearn.cluster.HDBSCAN without needing a separate package. Its primary parameter is min_cluster_size, which sets the smallest group of points that can be considered a cluster. Unlike DBSCAN, it does not need an eps parameter.

from sklearn.cluster import HDBSCAN
from sklearn.datasets import make_blobs

# Create data with clusters of varying density
centers = [[-0.85, -0.85], [-0.85, 0.85], [3, 3], [3, -3]]
X_varied, _ = make_blobs(
    n_samples=750,
    centers=centers,
    cluster_std=[0.2, 0.35, 1.35, 1.35],
    random_state=0
)

# Fit HDBSCAN -- no need to specify number of clusters or eps
hdbscan_model = HDBSCAN(
    min_cluster_size=15,
    min_samples=5
)
hdbscan_model.fit(X_varied)

labels = hdbscan_model.labels_
probabilities = hdbscan_model.probabilities_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {list(labels).count(-1)}")

# Each point gets a membership probability
print(f"Average cluster membership strength: {probabilities[labels != -1].mean():.3f}")

One of HDBSCAN's distinguishing features is the probabilities_ attribute. Each data point receives a score between 0.0 and 1.0 indicating how strongly it belongs to its assigned cluster. Points at the core of a cluster receive scores near 1.0, while points at the edges receive lower scores. Noise points get a probability of 0.0. This soft clustering capability is invaluable when you need confidence scores rather than hard assignments.

Pro Tip

HDBSCAN offers two cluster selection methods via the cluster_selection_method parameter. The default "eom" (Excess of Mass) finds the most persistent clusters and tends to produce fewer, broader groups. Setting it to "leaf" returns the finest-grained clusters at the bottom of the hierarchy, which is useful when you want many small, homogeneous groups.

Agglomerative Clustering

Agglomerative Clustering is a bottom-up hierarchical approach. It starts with every data point as its own cluster, then repeatedly merges the two closest clusters until the desired number of clusters is reached (or a distance threshold is met). The result can be visualized as a dendrogram -- a tree structure showing the sequence of merges.

The linkage parameter controls how distances between clusters are calculated: "ward" minimizes within-cluster variance (producing compact, spherical clusters), "complete" uses the maximum distance between points in two clusters, "average" uses the mean distance, and "single" uses the minimum distance (good for elongated clusters).

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Fit Agglomerative Clustering with Ward linkage
agg = AgglomerativeClustering(
    n_clusters=4,
    linkage="ward"
)
agg.fit(X_scaled)
print(f"Labels: {np.unique(agg.labels_)}")

# Generate a dendrogram to visualize the hierarchy
Z = linkage(X_scaled, method="ward")

plt.figure(figsize=(12, 5))
dendrogram(
    Z,
    truncate_mode="lastp",
    p=30,
    leaf_rotation=90,
    leaf_font_size=8,
    show_contracted=True
)
plt.title("Dendrogram (Ward Linkage)")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.tight_layout()
plt.savefig("dendrogram.png", dpi=150)
plt.show()

The dendrogram is a powerful diagnostic tool. Tall vertical lines indicate well-separated merges (strong cluster boundaries), while short lines indicate clusters that are barely distinct. You can use the dendrogram to visually decide how many clusters are appropriate -- draw a horizontal line and count how many vertical lines it crosses.

Note

Agglomerative Clustering has a time complexity of O(n^3) and memory complexity of O(n^2) in its general form, which makes it impractical for very large datasets. If you have more than about 10,000 samples, consider using K-Means or HDBSCAN instead, or use the connectivity parameter to impose structure and speed things up.

Evaluating Cluster Quality

Since clustering is unsupervised, there are no ground-truth labels to score against in production. Scikit-learn provides several internal evaluation metrics that measure cluster quality based on the data itself.

Silhouette Score

The silhouette score measures how similar each point is to its own cluster compared to the nearest neighboring cluster. Scores range from -1 to 1, where values near 1 mean points are well-matched to their cluster, values near 0 mean points sit on the boundary between clusters, and negative values mean points may have been assigned to the wrong cluster.

from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score

# Evaluate K-Means clustering
sil_score = silhouette_score(X_scaled, kmeans.labels_)
ch_score = calinski_harabasz_score(X_scaled, kmeans.labels_)
db_score = davies_bouldin_score(X_scaled, kmeans.labels_)

print(f"Silhouette Score:       {sil_score:.3f}")
print(f"Calinski-Harabasz Index: {ch_score:.1f}")
print(f"Davies-Bouldin Index:    {db_score:.3f}")

Comparing Multiple Algorithms

A practical workflow is to run several algorithms on the same data and compare their evaluation scores side by side. Here is a function that automates that comparison.

from sklearn.cluster import KMeans, DBSCAN, HDBSCAN, AgglomerativeClustering

def compare_clustering(X, algorithms):
    """Compare clustering algorithms using internal metrics."""
    results = []

    for name, model in algorithms:
        model.fit(X)
        labels = model.labels_

        # Skip evaluation if only one cluster or all noise
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        if n_clusters < 2:
            print(f"{name}: Only {n_clusters} cluster(s) found, skipping.")
            continue

        # Filter out noise points for scoring
        mask = labels != -1
        if mask.sum() < 2:
            continue

        sil = silhouette_score(X[mask], labels[mask])
        ch = calinski_harabasz_score(X[mask], labels[mask])
        db = davies_bouldin_score(X[mask], labels[mask])

        results.append({
            "Algorithm": name,
            "Clusters": n_clusters,
            "Noise": list(labels).count(-1),
            "Silhouette": round(sil, 3),
            "Calinski-Harabasz": round(ch, 1),
            "Davies-Bouldin": round(db, 3)
        })

    return results

# Define algorithms to compare
algorithms = [
    ("K-Means (k=4)", KMeans(n_clusters=4, n_init="auto", random_state=42)),
    ("DBSCAN", DBSCAN(eps=0.5, min_samples=5)),
    ("HDBSCAN", HDBSCAN(min_cluster_size=15)),
    ("Agglomerative", AgglomerativeClustering(n_clusters=4, linkage="ward")),
]

results = compare_clustering(X_scaled, algorithms)
for r in results:
    print(r)
Pro Tip

When comparing density-based algorithms (DBSCAN, HDBSCAN) against centroid-based ones (K-Means), remember that noise-aware algorithms intentionally exclude outlier points. Filter out noise labels before computing silhouette scores, or the comparison will be skewed.

Choosing the Right Algorithm

No single clustering algorithm is the right choice for every dataset. Each has strengths that align with specific data characteristics. Here is a practical decision framework.

Use K-Means when you know or can estimate the number of clusters, your clusters are roughly spherical and similarly sized, and you need speed on large datasets. K-Means scales well and is the fastest option for straightforward clustering tasks.

Use DBSCAN when your data contains noise or outliers that should be explicitly separated, your clusters have non-spherical shapes (crescents, rings, irregular blobs), and cluster density is relatively uniform. The main difficulty is choosing eps correctly.

Use HDBSCAN when clusters vary in density, you do not know the number of clusters in advance, and you want soft cluster membership scores. HDBSCAN is the most flexible density-based option and requires less parameter tuning than DBSCAN.

Use Agglomerative Clustering when you want to explore hierarchical relationships in your data, the dendrogram itself is a useful output, or you need fine control over how clusters merge. It works well on small to mid-sized datasets.

"There is no single best clustering algorithm. The choice depends on the structure of the data, the scale of the dataset, and what questions you are trying to answer." — Scikit-learn documentation

Key Takeaways

  1. Always scale your features first. Clustering algorithms rely on distance calculations. Unscaled features with different ranges will distort the results. Use StandardScaler or MinMaxScaler before fitting any clustering model.
  2. K-Means is fast but rigid. It requires a predefined k, assumes spherical clusters, and cannot handle noise. Use the elbow method or silhouette analysis to choose k, and switch to a density-based method when the data does not fit K-Means' assumptions.
  3. HDBSCAN is the go-to density-based algorithm. It removes the need to tune eps, handles varying densities, provides soft cluster membership, and is built into scikit-learn as of version 1.3. Start here when you do not know your cluster count or density profile.
  4. Evaluate with multiple metrics. No single score tells the full story. Combine silhouette score (overall separation), Calinski-Harabasz index (variance ratio), and Davies-Bouldin index (cluster similarity) for a well-rounded assessment.
  5. Visualize before you commit. Plot your clusters in 2D (using PCA or t-SNE if your data is high-dimensional) to sanity-check the results. A good evaluation score means nothing if the clusters do not make sense in context.

Clustering is as much an art as it is a science. The algorithm gives you the groupings, but understanding whether those groupings are meaningful requires domain knowledge. Start with K-Means for speed, move to HDBSCAN for flexibility, and use Agglomerative Clustering when hierarchy matters. Run multiple algorithms, compare their outputs, and let the data guide your decision.

back to articles