Hierarchical Clustering in Python

Hierarchical clustering builds a tree of clusters from your data, letting you explore groupings at every level of granularity without ever committing to a fixed number of clusters upfront. This guide walks through the algorithm from first principles, then implements it in Python with SciPy and scikit-learn, and goes further into territory that tutorials typically skip: rigorous cluster validation, scaling strategies for real-world datasets, the preprocessing decisions that silently make or break your results, and the mental models that separate a clustering practitioner from someone who just calls linkage().

Clustering is one of the foundational tasks in unsupervised machine learning. While algorithms like K-Means require you to specify the number of clusters before running, hierarchical clustering takes a different approach. It produces a nested series of groupings, organized into a tree structure called a dendrogram. You can then decide how many clusters to extract after examining the results, giving you flexibility that flat clustering methods cannot match.

Python offers two primary libraries for hierarchical clustering: scipy.cluster.hierarchy for low-level control over linkage computation and dendrogram visualization, and sklearn.cluster.AgglomerativeClustering for a higher-level interface that integrates naturally with the scikit-learn pipeline. This article covers both approaches with complete, runnable code. Current stable versions at time of writing are SciPy 1.17.1 (released February 2026) and scikit-learn 1.8.0 (released December 2025).

What Is Hierarchical Clustering

Hierarchical clustering organizes data into a tree of clusters. There are two main strategies for building that tree:

Agglomerative (bottom-up) starts by treating every data point as its own cluster. At each step, the two closest clusters are merged into one. This continues until all points belong to a single cluster. Agglomerative clustering is by far the more common approach, and it is what SciPy and scikit-learn implement.

Divisive (top-down) begins with all data points in a single cluster, then recursively splits clusters apart. While conceptually straightforward, divisive clustering is more computationally expensive and less widely supported in standard Python libraries. The DIANA (DIvisive ANAlysis) algorithm is the classical implementation, available in R's cluster package but not natively in Python's major ML libraries.

The key output of agglomerative clustering is the linkage matrix, a structured array that records which clusters were merged at each step, the distance between them, and how many original data points the new cluster contains. This matrix encodes the entire tree structure and serves as the input for dendrogram visualization and flat cluster extraction.

Here is something worth internalizing before writing a single line of code: the dendrogram is not just a visualization. It is the complete clustering result. Every possible flat partition of your data, at every possible number of clusters, is encoded in that single tree. The act of "choosing k clusters" is equivalent to making a horizontal cut across the dendrogram at a specific height. This is a fundamentally different relationship between the algorithm and the analyst than what you get with K-Means, where each value of k produces an entirely separate model that must be trained independently.

Note

Unlike K-Means, hierarchical clustering does not require specifying the number of clusters before running the algorithm. You choose the number of clusters after inspecting the dendrogram, which makes it especially useful for exploratory data analysis where the structure of the data is not yet known.

The Preprocessing Step Nobody Skips Twice

Hierarchical clustering computes distances between data points, which means the scale of your features directly determines the clustering outcome. If one feature is measured in thousands (like annual salary) and another in single digits (like years of experience), the salary feature will dominate every distance calculation, and the years-of-experience feature will have almost no influence on which clusters form.

This is not a theoretical concern. It is the single most common reason that new practitioners get nonsensical clustering results and then blame the algorithm. The solution is to standardize your features before clustering:

from sklearn.preprocessing import StandardScaler
import numpy as np

# Always standardize before distance-based clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Now features contribute equally to distance calculations
# Each feature has mean=0 and std=1

StandardScaler transforms each feature to have zero mean and unit variance. This ensures that no single feature overwhelms the distance computation simply because of its measurement scale. For features with heavy-tailed distributions or significant outliers, RobustScaler (which uses the median and interquartile range instead of the mean and standard deviation) can be a better choice.

There is a deeper principle at work here. The SciPy documentation for the linkage() function states that methods like Ward, centroid, and median are only valid with Euclidean distance (source: SciPy linkage docs). Euclidean distance treats all dimensions equally by definition, so feeding it unscaled features violates the geometric assumptions that make Ward linkage mathematically sound. You are not just getting "worse" results; you are getting mathematically invalid results from a function that cannot warn you about the problem.

Pro Tip

If your features are already on comparable scales (for example, all are percentages ranging from 0 to 100), standardization may not be necessary and could even be counterproductive. Use domain knowledge to decide. But when in doubt, scale first.

Linkage Methods Explained

The linkage method determines how the distance between two clusters is calculated when deciding which pair to merge next. The choice of linkage has a significant effect on the shape and quality of the resulting clusters. SciPy's linkage() function supports several methods:

Ward minimizes the total within-cluster variance. At each step, it merges the pair of clusters that results in the smallest increase in total variance. Ward tends to produce compact, roughly equal-sized clusters and is a strong default choice. It requires the use of Euclidean distance. The SciPy implementation uses the nearest-neighbors chain algorithm, giving it O(n²) time complexity (source: SciPy linkage docs).

Single (nearest point) defines the distance between two clusters as the minimum distance between any pair of points, one from each cluster. Single linkage can detect elongated or irregularly shaped clusters, but it is susceptible to the chaining effect, where a trail of closely spaced points can cause very different clusters to merge prematurely. SciPy implements single linkage with an optimized minimum spanning tree algorithm, also achieving O(n²) time complexity.

Complete (farthest point) uses the maximum distance between any pair of points from the two clusters. This tends to produce tight, compact clusters but can be sensitive to outliers, since a single distant point can inflate the inter-cluster distance.

Average (UPGMA) computes the mean distance between all pairs of points across two clusters. Average linkage provides a middle ground between single and complete, balancing sensitivity to outliers with the ability to handle moderately non-spherical cluster shapes. It is the method that tends to produce the highest cophenetic correlation, meaning the dendrogram structure faithfully represents the original pairwise distances.

Weighted (WPGMA) is similar to average linkage but does not weight by cluster size. When two clusters of very different sizes merge, weighted linkage treats them equally rather than letting the larger cluster dominate the calculation.

Centroid measures the Euclidean distance between the centroids of two clusters. Like Ward, it requires Euclidean distance to produce valid results. One important caveat: centroid and median linkage can produce inversions in the dendrogram, where a merge occurs at a lower distance than a previous merge. This means the dendrogram's vertical axis is no longer monotonically increasing, which can make interpretation confusing and invalidates the standard "cut at a height" approach to extracting flat clusters.

Median (WPGMC) is a variant of centroid linkage that gives equal weight to the two clusters being merged, regardless of size. It shares the same inversion problem as centroid linkage.

An important detail about computational complexity: single, complete, average, weighted, and Ward linkage all run in O(n²) time in SciPy. Centroid and median linkage use a naive O(n³) algorithm (source: SciPy linkage docs). All methods use O(n²) memory regardless of which linkage you choose.

Pro Tip

Start with Ward linkage as your default. It produces well-balanced clusters and works reliably on many datasets. Switch to average linkage if you need to use a non-Euclidean distance metric, or to single linkage if you expect elongated or chain-like cluster shapes. Avoid centroid and median linkage unless you have a specific reason to use them, because their inversion behavior and O(n³) complexity create unnecessary complications.

Building a Dendrogram with SciPy

The scipy.cluster.hierarchy module provides the core functions for hierarchical clustering. The two essential functions are linkage() for computing the cluster hierarchy and dendrogram() for visualizing it.

Here is a complete example that generates sample data, computes the linkage matrix, and plots a dendrogram:

import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate sample data: three clusters of 2D points
np.random.seed(42)
cluster_1 = np.random.randn(15, 2) + [0, 0]
cluster_2 = np.random.randn(15, 2) + [5, 5]
cluster_3 = np.random.randn(15, 2) + [10, 0]
X = np.vstack([cluster_1, cluster_2, cluster_3])

# Standardize features (critical for real-world data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compute the linkage matrix using Ward's method
Z = linkage(X_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z, leaf_rotation=90, leaf_font_size=9)
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.tight_layout()
plt.savefig('dendrogram_ward.png', dpi=150)
plt.show()

The linkage() function accepts a 2D array of observations and returns a linkage matrix Z with shape (n-1, 4). Each row records a single merge operation: the indices of the two clusters being merged (columns 0 and 1), the distance between them (column 2), and the number of original observations in the new cluster (column 3).

The dendrogram is a tree diagram where the vertical axis represents the distance at which each merge occurred. Tall vertical lines indicate large jumps in distance, which typically correspond to natural boundaries between well-separated clusters. The optimal_ordering=True parameter in linkage() can reorder the leaves to minimize the distance between adjacent leaves, making the dendrogram easier to read at the cost of additional computation time.

Understanding the Linkage Matrix

The linkage matrix can be inspected directly to understand how the algorithm proceeded:

# Inspect the first five merges
print("Linkage matrix shape:", Z.shape)
print("\nFirst 5 merges:")
print("Cluster A | Cluster B | Distance  | Size")
print("-" * 45)
for i in range(5):
    print(f"  {Z[i, 0]:7.0f} | {Z[i, 1]:9.0f} | {Z[i, 2]:9.4f} | {Z[i, 3]:4.0f}")

# Clusters with index < n (45) are original data points
# Clusters with index >= n are merged clusters formed at step (index - n)

Indices below the total number of samples refer to original data points. Indices at or above that value refer to intermediate clusters created during the merging process. For example, if your data has 45 observations, a cluster index of 47 would refer to the cluster formed at the third merge step (index 47 - 45 = step 2, zero-indexed).

Cutting the Dendrogram into Flat Clusters

After computing the linkage matrix, you need to extract a set of flat cluster assignments. SciPy provides the fcluster() function for this purpose. There are three primary strategies for cutting the dendrogram:

Cutting by Distance Threshold

You can specify a maximum distance value. Any clusters that would need to merge at a distance above this threshold remain separate:

from scipy.cluster.hierarchy import fcluster

# Cut the dendrogram at a distance threshold
max_distance = 8.0
labels_dist = fcluster(Z, t=max_distance, criterion='distance')

print(f"Number of clusters: {len(np.unique(labels_dist))}")
print(f"Cluster labels: {labels_dist}")

# Visualize the clustered data
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels_dist, cmap='viridis',
                      edgecolors='white', linewidth=0.5, s=60)
plt.colorbar(scatter, label='Cluster')
plt.title(f'Clusters from Distance Threshold = {max_distance}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.savefig('clusters_distance.png', dpi=150)
plt.show()

Cutting by Number of Clusters

Alternatively, you can request a specific number of clusters directly:

# Cut the dendrogram to produce exactly 3 clusters
labels_k = fcluster(Z, t=3, criterion='maxclust')

print(f"Number of clusters: {len(np.unique(labels_k))}")

# Visualize
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels_k, cmap='Set2',
                      edgecolors='white', linewidth=0.5, s=60)
plt.colorbar(scatter, label='Cluster')
plt.title('Clusters with maxclust = 3')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.savefig('clusters_maxclust.png', dpi=150)
plt.show()

Cutting by Inconsistency

The inconsistency criterion offers a more nuanced approach. It compares each merge distance to the average merge distance in its local neighborhood within the tree. A merge is considered "inconsistent" (and becomes a cluster boundary) if its distance is significantly higher than the average of the merges below it:

from scipy.cluster.hierarchy import inconsistency

# Compute the inconsistency statistics
depth = 2  # Number of levels to look back
incons = inconsistency(Z, d=depth)

# Cut using inconsistency: t is the inconsistency threshold
labels_incons = fcluster(Z, t=1.5, criterion='inconsistent', depth=depth)
print(f"Clusters from inconsistency criterion: {len(np.unique(labels_incons))}")

# Inspect inconsistency values for the last 5 merges
print("\nLast 5 inconsistency values:")
print("Mean dist | Std dist  | Count | Inconsistency")
print("-" * 50)
for i in range(-5, 0):
    print(f"  {incons[i, 0]:8.4f} | {incons[i, 1]:8.4f} | {incons[i, 2]:5.0f} | {incons[i, 3]:12.4f}")

The inconsistency coefficient for a merge is calculated as (merge distance - mean distance) / standard deviation, where the mean and standard deviation come from the merge distances in the subtree below. A high inconsistency value means the merge distance is unusually large compared to the merges that preceded it, which signals a natural cluster boundary. This approach adapts to the local structure of the dendrogram rather than applying a single global threshold.

Note

To choose a good distance threshold, look for the largest vertical gap in the dendrogram. The height where that gap occurs is a natural cutoff point. If the dendrogram shows one very tall vertical line followed by shorter ones, the data likely has a clear cluster structure at that level. The inconsistency criterion automates this intuition.

Using scikit-learn AgglomerativeClustering

Scikit-learn provides AgglomerativeClustering, which wraps hierarchical clustering in the familiar scikit-learn estimator interface with fit() and fit_predict(). This is convenient when hierarchical clustering is part of a larger machine learning pipeline.

from sklearn.cluster import AgglomerativeClustering

# Cluster with 3 groups using Ward linkage
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = model.fit_predict(X_scaled)

print(f"Cluster labels: {labels_sklearn}")
print(f"Number of leaves: {model.n_leaves_}")
print(f"Number of connected components: {model.n_connected_components_}")

# Visualize
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels_sklearn, cmap='Set1',
                      edgecolors='white', linewidth=0.5, s=60)
plt.colorbar(scatter, label='Cluster')
plt.title('AgglomerativeClustering (Ward, n_clusters=3)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.savefig('agglomerative_sklearn.png', dpi=150)
plt.show()

Generating a Dendrogram from scikit-learn

Scikit-learn does not have a built-in dendrogram function, but you can construct a linkage matrix from the fitted model's attributes and then use SciPy's dendrogram():

from scipy.cluster.hierarchy import dendrogram

def plot_sklearn_dendrogram(model, **kwargs):
    """Build a linkage matrix from a fitted AgglomerativeClustering model."""
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)

    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    dendrogram(linkage_matrix, **kwargs)

# Refit with distance_threshold=0 to compute the full tree
model_full = AgglomerativeClustering(
    distance_threshold=0,
    n_clusters=None,
    linkage='ward'
)
model_full.fit(X_scaled)

plt.figure(figsize=(12, 6))
plot_sklearn_dendrogram(model_full, truncate_mode='level', p=3)
plt.title('Dendrogram from scikit-learn AgglomerativeClustering')
plt.xlabel('Number of points in node (or index if leaf)')
plt.ylabel('Distance')
plt.tight_layout()
plt.savefig('dendrogram_sklearn.png', dpi=150)
plt.show()
Pro Tip

Setting distance_threshold=0 and n_clusters=None forces scikit-learn to compute the full cluster tree and store all merge distances. This is required if you want to visualize the dendrogram afterward. Without these settings, the distances_ attribute is not populated.

Comparing Linkage Methods Visually

Different linkage methods can produce very different clustering results on the same data. The following example generates the same data and applies four linkage strategies side by side:

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
methods = ['ward', 'single', 'complete', 'average']

for ax, method in zip(axes.ravel(), methods):
    Z_method = linkage(X_scaled, method=method)
    labels_method = fcluster(Z_method, t=3, criterion='maxclust')

    ax.scatter(X[:, 0], X[:, 1], c=labels_method, cmap='Set2',
               edgecolors='white', linewidth=0.5, s=50)
    ax.set_title(f'{method.capitalize()} Linkage', fontweight='bold')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.suptitle('Comparing Linkage Methods (3 Clusters)', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.savefig('linkage_comparison.png', dpi=150)
plt.show()

With well-separated spherical clusters, all four methods will typically produce similar results. The differences become apparent with overlapping clusters, elongated shapes, or noisy data. Single linkage will connect nearby points in a chain, which can be either useful or problematic depending on the true cluster shape. Ward and complete linkage favor compact, rounded clusters. Average linkage provides a compromise between these extremes.

The scikit-learn documentation includes a useful visual comparison of how these methods behave on different toy datasets such as noisy circles, noisy moons, and anisotropic blobs (source: scikit-learn linkage comparison). That example demonstrates that Ward linkage is the strongest performer on noisy data, while average and complete linkage work well on cleanly separated globular clusters.

Validating Your Clusters

Producing clusters is easy. Knowing whether they are meaningful is the hard part. This is the section that separates a tutorial from a guide you can actually use in production. There are three distinct validation approaches, and each answers a different question.

Cophenetic Correlation: Does the Tree Faithfully Represent Your Data?

The cophenetic correlation coefficient measures how well the dendrogram preserves the original pairwise distances between data points. It compares the actual distance between every pair of points to the "cophenetic distance," which is the height in the dendrogram at which those two points first become part of the same cluster. A value close to 1.0 indicates that the hierarchical structure is a good representation of the distance matrix:

from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

# Compute pairwise distances
original_distances = pdist(X_scaled)

# Compare cophenetic correlation for each linkage method
for method in ['ward', 'single', 'complete', 'average']:
    Z_m = linkage(X_scaled, method=method)
    coph_corr, coph_dists = cophenet(Z_m, original_distances)
    print(f"{method:10s} linkage  ->  cophenetic correlation: {coph_corr:.4f}")

Average linkage often yields the highest cophenetic correlation, meaning it tends to preserve the original distance structure well. However, a high cophenetic correlation does not always mean the clusters are the best for your specific task. It tells you whether the tree is a faithful representation of the data geometry, not whether the resulting clusters are useful for your problem.

Silhouette Score: Are the Clusters Well-Separated?

The silhouette coefficient measures, for each data point, how similar it is to other points in its own cluster compared to points in the nearest neighboring cluster. It ranges from -1 to 1: values near +1 mean the point is well-matched to its cluster and poorly matched to neighboring clusters, 0 means the point is on the boundary between two clusters, and negative values suggest the point may have been assigned to the wrong cluster. Peter Rousseeuw introduced this metric in his 1987 paper in the Journal of Computational and Applied Mathematics.

from sklearn.metrics import silhouette_score, silhouette_samples

# Test different numbers of clusters and evaluate
for n_clusters in range(2, 7):
    labels_test = fcluster(Z, t=n_clusters, criterion='maxclust')
    sil_score = silhouette_score(X_scaled, labels_test)
    print(f"  k={n_clusters}:  silhouette score = {sil_score:.4f}")

# Detailed per-sample silhouette analysis for the best k
best_k = 3
labels_best = fcluster(Z, t=best_k, criterion='maxclust')
sample_silhouettes = silhouette_samples(X_scaled, labels_best)

# Identify poorly clustered points (silhouette < 0)
problem_points = np.where(sample_silhouettes < 0)[0]
print(f"\nPoints with negative silhouette (possibly misassigned): {problem_points}")

The per-sample silhouette analysis is where the real diagnostic power lives. An overall silhouette score of 0.65 might look acceptable, but if you inspect the individual values, you might find that one cluster has consistently low silhouette scores while the others are strong. This indicates that the weak cluster may not represent a genuine grouping in the data, or that its boundaries need adjustment.

Davies-Bouldin Index: A Second Opinion

The Davies-Bouldin index evaluates cluster quality by comparing each cluster's internal scatter to the distance between cluster centers. Lower values indicate better separation. It provides a different perspective than the silhouette score because it focuses on cluster-to-cluster relationships rather than point-to-cluster relationships:

from sklearn.metrics import davies_bouldin_score

for n_clusters in range(2, 7):
    labels_test = fcluster(Z, t=n_clusters, criterion='maxclust')
    db_score = davies_bouldin_score(X_scaled, labels_test)
    print(f"  k={n_clusters}:  Davies-Bouldin index = {db_score:.4f}")

# Lower Davies-Bouldin values indicate better-defined clusters

Using multiple validation metrics together gives you a more robust picture. If both the silhouette score and the Davies-Bouldin index agree on the optimal number of clusters, you can have greater confidence in that choice. When they disagree, it often signals that the data has ambiguous structure at that level of granularity, and you should examine the dendrogram more carefully to understand why.

Note

All internal validation metrics (silhouette, Davies-Bouldin, cophenetic correlation) measure geometric properties of the clusters. They cannot tell you whether the clusters are useful for your domain problem. A biologically meaningful grouping of gene expression profiles might have a mediocre silhouette score because biology does not always produce clean geometric separations. Always combine quantitative metrics with domain expertise.

Handling Outliers Before and After Clustering

Hierarchical clustering has no built-in concept of noise or outliers. Every data point must belong to some cluster, which means outliers can distort cluster boundaries and inflate merge distances. Complete linkage is particularly sensitive to this: a single extreme point can prevent two otherwise natural clusters from merging because the maximum inter-cluster distance is dominated by the outlier.

There are two complementary strategies for handling this. Before clustering, you can detect and remove (or flag) outliers using methods that are independent of the clustering itself:

from sklearn.ensemble import IsolationForest

# Detect outliers before clustering
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_labels = iso_forest.fit_predict(X_scaled)

# Separate inliers and outliers
inliers_mask = outlier_labels == 1
X_clean = X_scaled[inliers_mask]
X_outliers = X_scaled[~inliers_mask]

print(f"Inliers: {inliers_mask.sum()}, Outliers: {(~inliers_mask).sum()}")

# Cluster only the clean data
Z_clean = linkage(X_clean, method='ward')
labels_clean = fcluster(Z_clean, t=3, criterion='maxclust')

After clustering, you can use the silhouette scores from the validation step to identify points that may be acting as outliers within the clustered result. Points with strongly negative silhouette scores are candidates for removal or reclassification. You can also assign outliers to existing clusters after the fact by finding the nearest cluster centroid, which avoids letting them influence the cluster structure during the merge process.

If outlier handling is a primary concern for your application, consider whether HDBSCAN (discussed in the scaling section below) might be a better fit for your problem. HDBSCAN explicitly models noise points and does not force every observation into a cluster.

Scaling Beyond 10,000 Points

All hierarchical clustering algorithms in SciPy require O(n²) memory to store the pairwise distance matrix. For 10,000 observations, this matrix has 50 million entries. At 20,000 observations, it has 200 million entries. By 50,000 observations, the memory requirements alone exceed what is practical on many machines, and the computation time becomes prohibitive.

This is not a limitation you can algorithm your way around within the standard agglomerative framework. But there are several concrete strategies for working with larger datasets.

fastcluster: A Drop-In Replacement

The fastcluster library, published by Daniel Mullner in the Journal of Statistical Software (2013), provides a C++ implementation of the same linkage algorithms with the same API as SciPy but significantly faster constant factors. It serves as a direct drop-in replacement:

import fastcluster

# Drop-in replacement for scipy.cluster.hierarchy.linkage
Z_fast = fastcluster.linkage(X_scaled, method='ward')

# The output is identical in format to SciPy's linkage matrix
# Use SciPy's dendrogram() and fcluster() as before
labels_fast = fcluster(Z_fast, t=3, criterion='maxclust')

# For vector data, linkage_vector is even more memory-efficient
Z_fast_vec = fastcluster.linkage_vector(X_scaled, method='ward')

# Half memory usage with preserve_input=False (destroys input)
import copy
X_temp = copy.deepcopy(X_scaled)
Z_fast_mem = fastcluster.linkage(X_temp, method='ward', preserve_input=False)
del X_temp  # Input is now invalid

The linkage_vector function is particularly valuable because it operates directly on the observation vectors rather than requiring a precomputed distance matrix, which can significantly reduce memory usage. Mullner's benchmarks demonstrate that fastcluster achieves the theoretical optimal O(n²) time complexity for all linkage methods, whereas SciPy's implementation of centroid and median linkage runs in O(n³) (source: fastcluster project page).

HDBSCAN: When You Need Density-Aware Hierarchical Clustering at Scale

If your dataset exceeds 50,000 points, or if you need a hierarchical method that handles variable-density clusters and noise, HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is the strongest alternative. Unlike agglomerative clustering, HDBSCAN does not require you to specify the number of clusters and automatically determines cluster boundaries based on density stability:

from sklearn.cluster import HDBSCAN

# HDBSCAN is included in scikit-learn since version 1.3
clusterer = HDBSCAN(min_cluster_size=15, min_samples=5)
hdbscan_labels = clusterer.fit_predict(X_scaled)

# Points labeled -1 are noise (outliers)
n_clusters = len(set(hdbscan_labels)) - (1 if -1 in hdbscan_labels else 0)
n_noise = (hdbscan_labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

# Cluster probabilities indicate confidence of assignment
print(f"Assignment probabilities: {clusterer.probabilities_[:10]}")

HDBSCAN's key advantage is that it identifies clusters of varying density without requiring a fixed distance threshold, and it explicitly labels noise points rather than forcing them into clusters. The hdbscan library's performance benchmarking shows that HDBSCAN scales comparably to DBSCAN and K-Means for datasets up to 200,000+ points, while sklearn's AgglomerativeClustering becomes impractical beyond roughly 10,000-20,000 points (source: HDBSCAN benchmarks).

Subsampling and Approximate Strategies

For datasets where you still want agglomerative clustering specifically but the full dataset is too large, a practical workflow is to cluster a representative subsample first, then assign the remaining points to existing clusters:

# Strategy: Cluster a subsample, then assign remaining points
from sklearn.utils import resample
from scipy.spatial.distance import cdist

n_subsample = 5000  # Manageable size for hierarchical clustering

# Subsample (stratified if you have prior labels, random otherwise)
X_sub = resample(X_scaled, n_samples=n_subsample, random_state=42)

# Cluster the subsample
Z_sub = linkage(X_sub, method='ward')
labels_sub = fcluster(Z_sub, t=3, criterion='maxclust')

# Compute cluster centroids from the subsample
centroids = np.array([X_sub[labels_sub == k].mean(axis=0)
                       for k in np.unique(labels_sub)])

# Assign all remaining points to the nearest centroid
distances_to_centroids = cdist(X_scaled, centroids)
labels_all = distances_to_centroids.argmin(axis=1) + 1  # 1-indexed like fcluster

This approach sacrifices the hierarchical structure for the full dataset but preserves it for the subsample, which you can still inspect via the dendrogram. For many practical applications, this is an acceptable tradeoff.

Real-World Applications and Domain Patterns

Understanding when to reach for hierarchical clustering instead of K-Means or DBSCAN requires recognizing the problem patterns where a tree structure provides genuine value rather than just being a different way to get flat clusters.

Genomics and phylogenetics. Hierarchical clustering is foundational in bioinformatics. Gene expression data from RNA-seq experiments is routinely clustered hierarchically to discover co-expressed gene modules and to identify patient subgroups. The dendrogram itself is the primary output because it reveals evolutionary or functional relationships between genes. Ward linkage with Euclidean distance is the default for expression matrices; average linkage with correlation-based distance is common when the magnitude of expression matters less than the expression pattern across conditions.

Document and text organization. When building document taxonomies or organizing a corpus, the hierarchical structure maps naturally to a browsable category tree. A flat partition would force you to choose a single level of granularity (is "machine learning" one topic or should it be split into "supervised learning" and "unsupervised learning"?). A dendrogram gives you both simultaneously.

Customer segmentation at multiple granularities. Marketing teams often need to segment customers at different levels of detail: broad segments for brand strategy, finer segments for campaign targeting, and micro-segments for personalization. A single hierarchical clustering run produces all three levels, and the dendrogram shows exactly how the finer segments nest within the broader ones.

Image segmentation with spatial constraints. Scikit-learn's AgglomerativeClustering supports a connectivity parameter that restricts which clusters can be merged. By passing a connectivity matrix derived from spatial adjacency (using sklearn.neighbors.kneighbors_graph), you can enforce that only spatially neighboring regions merge, which is exactly what image segmentation requires:

from sklearn.neighbors import kneighbors_graph

# Build a connectivity matrix based on 5 nearest neighbors
connectivity = kneighbors_graph(X_scaled, n_neighbors=5, include_self=False)

# Cluster with connectivity constraints
model_conn = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward',
    connectivity=connectivity
)
labels_conn = model_conn.fit_predict(X_scaled)

print(f"Cluster labels with connectivity: {labels_conn}")

Anomaly detection through dendrogram structure. Points that merge into the tree very late (at high distances) are structurally unusual in the data. You can identify them by examining which observations appear as singleton branches that join the rest of the tree only at the highest merge levels. This provides a form of anomaly detection that is complementary to dedicated outlier methods like Isolation Forest.

Practical Considerations and Performance

Time and Memory Complexity

Hierarchical clustering requires computing and storing a pairwise distance matrix, which means memory usage scales as O(n²). For the optimized algorithms in SciPy (single, complete, average, weighted, and Ward linkage), time complexity is O(n²). The centroid and median methods use a naive algorithm with O(n³) time. For datasets larger than around 10,000 to 20,000 observations, standard hierarchical clustering may become impractical. Consider fastcluster for moderate speedups with the same algorithm, HDBSCAN for density-aware hierarchical clustering at scale, or BIRCH (sklearn.cluster.Birch) for an incremental approach that processes data in a single pass and can handle datasets that do not fit in memory.

When to Use Hierarchical Clustering

Hierarchical clustering is a strong choice when you need to explore your data at multiple levels of granularity, when you do not know the number of clusters in advance, when the dendrogram itself is a valuable output (for example, in phylogenetics, taxonomy, or document organization), or when your dataset is small enough that the O(n²) cost is acceptable. It is a poor choice when you have more than 20,000 observations and need to cluster the full dataset (unless you use fastcluster or a subsampling approach), when your clusters have dramatically varying densities (use HDBSCAN instead), or when you need the algorithm to handle noise explicitly.

Warning

Ward, centroid, and median linkage methods are only valid with Euclidean distance. If you pass precomputed distances from a non-Euclidean metric and use one of these methods, the results will be mathematically incorrect. Use average or complete linkage instead when working with non-Euclidean metrics such as cosine similarity, Manhattan distance, or correlation-based distances.

Key Takeaways

  1. Preprocess first, cluster second. Standardize your features before computing distances. Unscaled features produce mathematically invalid results with Euclidean-based linkage methods and are the most common source of poor clustering outcomes.
  2. The dendrogram is the complete result. Every possible flat partition at every possible number of clusters is encoded in the dendrogram. This is the fundamental advantage over flat methods like K-Means, which require training a separate model for each k.
  3. Linkage choice matters. Ward linkage is a reliable default for compact clusters with Euclidean distance. Average linkage is versatile across different distance metrics and preserves distance structure well. Single linkage detects elongated shapes but is prone to chaining. Avoid centroid and median linkage unless you specifically need them, because of inversion issues and O(n³) complexity.
  4. Two Python interfaces serve different purposes. Use scipy.cluster.hierarchy when you need dendrograms, cophenetic analysis, inconsistency-based cutting, or fine-grained control. Use sklearn.cluster.AgglomerativeClustering when you want integration with scikit-learn pipelines or connectivity constraints.
  5. Validate with multiple metrics. Cophenetic correlation tells you whether the dendrogram faithfully represents your data. Silhouette score tells you whether the clusters are well-separated. Davies-Bouldin index provides a second geometric perspective. None of these tell you whether the clusters are useful for your domain problem.
  6. Plan for scale. Standard agglomerative clustering works well up to 10,000-20,000 samples. Beyond that, use fastcluster as a drop-in speedup, HDBSCAN for density-aware hierarchical clustering, or a subsample-then-assign strategy.
  7. Handle outliers explicitly. Hierarchical clustering has no concept of noise. Either remove outliers before clustering (Isolation Forest, z-score filtering) or use HDBSCAN, which models noise directly.

Hierarchical clustering remains one of the most interpretable and flexible clustering methods available. Its ability to produce a full tree of groupings, combined with Python's library support in SciPy, scikit-learn, and fastcluster, makes it a practical choice for exploratory analysis, small-to-medium datasets, and any application where understanding the relationships between clusters is as important as the cluster assignments themselves. The key is to pair the algorithm with proper preprocessing, rigorous validation, and an honest assessment of whether your data's scale and structure are a good fit for the method.

back to articles