t-Distributed Stochastic Neighbor Embedding (t-SNE) in Python

When you are working with high-dimensional datasets, trying to visualize them directly is essentially impossible. t-Distributed Stochastic Neighbor Embedding, known as t-SNE, is one of the most widely used techniques for projecting complex, multi-dimensional data into two or three dimensions where patterns, clusters, and relationships become visible to the human eye. This article walks through how t-SNE works, how to implement it in Python using scikit-learn, and the practical considerations that separate a useful visualization from a misleading one.

Dimensionality reduction is a fundamental challenge in data science. Datasets frequently contain dozens or even hundreds of features, and while machine learning algorithms can operate in these high-dimensional spaces, humans cannot perceive more than three spatial dimensions. t-SNE was introduced by Laurens van der Maaten and Geoffrey Hinton in 2008 and has since become a standard tool for exploratory data analysis, particularly in fields like genomics, natural language processing, and computer vision where datasets routinely contain thousands of dimensions.

What Is t-SNE and Why Does It Matter

t-SNE is a nonlinear dimensionality reduction technique designed specifically for visualization. Unlike linear methods such as PCA (Principal Component Analysis), t-SNE excels at preserving the local structure of data. Points that are close together in the original high-dimensional space tend to remain close together in the reduced two-dimensional or three-dimensional embedding.

This makes t-SNE especially powerful for revealing clusters and groupings that would be invisible in a PCA projection. If your data contains natural clusters defined by complex, nonlinear boundaries, t-SNE will often expose them far more clearly than any linear technique could.

Note

t-SNE is a visualization tool, not a general-purpose dimensionality reduction method. The reduced dimensions do not carry interpretable meaning the way PCA components do, so t-SNE output should not be used as input features for downstream classifiers or regression models.

How t-SNE Works Under the Hood

The algorithm operates in two main stages. In the first stage, t-SNE constructs a probability distribution over pairs of points in the original high-dimensional space. For each pair of data points, it calculates the conditional probability that one point would pick the other as its neighbor, assuming neighbors are selected based on a Gaussian distribution centered at each point. Points that are close together receive a high probability, and points that are far apart receive a near-zero probability.

In the second stage, t-SNE defines a similar probability distribution over the points in the low-dimensional map. However, instead of using a Gaussian distribution, it uses a Student's t-distribution with one degree of freedom (also known as a Cauchy distribution). The heavier tails of the t-distribution solve what is known as the "crowding problem" -- in low-dimensional space, there is less room to accommodate moderate-distance neighbors, and the t-distribution gives these points more breathing room.

The algorithm then minimizes the Kullback-Leibler (KL) divergence between the two probability distributions using gradient descent. The KL divergence measures how different the low-dimensional representation is from the high-dimensional original. When this divergence is minimized, the low-dimensional map faithfully reflects the neighbor relationships in the original data.

Basic t-SNE with scikit-learn

scikit-learn provides an implementation of t-SNE through the sklearn.manifold.TSNE class. The API follows the standard scikit-learn pattern, making it straightforward to apply t-SNE to any array-like dataset.

import numpy as np
from sklearn.manifold import TSNE

# Create sample high-dimensional data
# 200 samples with 50 features each
np.random.seed(42)
X = np.random.randn(200, 50)

# Apply t-SNE to reduce to 2 dimensions
tsne = TSNE(
    n_components=2,
    random_state=42,
    learning_rate='auto',
    init='pca'
)
X_embedded = tsne.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Embedded shape: {X_embedded.shape}")
print(f"KL divergence: {tsne.kl_divergence_:.4f}")

The fit_transform() method takes the high-dimensional data and returns the low-dimensional embedding in a single step. The kl_divergence_ attribute stores the final value of the cost function, which tells you how well the embedding preserves the original neighbor relationships. Lower values indicate a better fit.

Pro Tip

As of scikit-learn 1.2 and later, the default learning_rate is 'auto', which sets the rate to max(N / early_exaggeration / 4, 50) where N is the sample size. This automatic setting follows research findings that significantly improve results across a wide range of dataset sizes, so you rarely need to tune this parameter manually.

Understanding Perplexity

Perplexity is the single most important hyperparameter in t-SNE, and understanding what it does is critical to producing meaningful visualizations. Perplexity can be thought of as a rough estimate of how many close neighbors each point has. It controls the balance between preserving local detail and maintaining a sense of the broader data structure.

The typical range for perplexity is between 5 and 50. Lower perplexity values focus heavily on very local relationships, which can cause the algorithm to fragment real clusters into smaller pieces. Higher perplexity values consider more neighbors, which can reveal larger-scale structure but may also blur the boundaries between distinct groups.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs

# Generate synthetic clustered data
X, y = make_blobs(
    n_samples=500,
    n_features=20,
    centers=5,
    random_state=42
)

# Try different perplexity values
perplexities = [5, 15, 30, 50]
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for ax, perplexity in zip(axes, perplexities):
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        random_state=42,
        learning_rate='auto',
        init='pca'
    )
    X_embedded = tsne.fit_transform(X)

    ax.scatter(
        X_embedded[:, 0],
        X_embedded[:, 1],
        c=y,
        cmap='tab10',
        s=15,
        alpha=0.7
    )
    ax.set_title(f'Perplexity = {perplexity}')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('Effect of Perplexity on t-SNE Output', fontsize=14)
plt.tight_layout()
plt.savefig('tsne_perplexity_comparison.png', dpi=150)
plt.show()

Running this code with different perplexity values demonstrates how dramatically this single parameter affects the output. With a perplexity of 5, you will often see tight, fragmented clusters. At 30 (the default), the clusters are typically well-formed and clearly separated. At 50, the structure may start to blur slightly, depending on the dataset. There is no universally correct perplexity value -- it depends entirely on the characteristics of your data.

Preprocessing with PCA Before t-SNE

When working with high-dimensional data that has many hundreds or thousands of features, running t-SNE directly on the raw data is computationally expensive and can produce noisy results. The scikit-learn documentation explicitly recommends reducing the number of dimensions first using PCA or TruncatedSVD, typically down to around 50 components.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# Assume X_high_dim has shape (n_samples, 784)
# Step 1: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_high_dim)

# Step 2: Reduce to 50 dimensions with PCA
pca = PCA(n_components=50, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")

# Step 3: Apply t-SNE on the PCA-reduced data
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    learning_rate='auto',
    init='pca',
    max_iter=1000
)
X_tsne = tsne.fit_transform(X_pca)
print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")

This two-stage approach offers several advantages. PCA removes noise from the data by discarding components with very low variance, which helps t-SNE focus on the meaningful structure. It also dramatically reduces computation time because t-SNE has to compute pairwise distances between all samples, and fewer features means faster distance calculations.

Note

The init='pca' parameter initializes the low-dimensional embedding using the first two PCA components rather than random positions. This generally produces more reproducible results and helps the gradient descent converge to a better solution.

Visualizing the Iris Dataset

The Iris dataset is a classic benchmark with 150 samples and 4 features across 3 species of iris flowers. Even though this dataset is small and low-dimensional, it provides a clear demonstration of how t-SNE separates known classes.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Apply t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    learning_rate='auto',
    init='pca'
)
X_tsne = tsne.fit_transform(X)

# Apply PCA for comparison
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Create side-by-side comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

colors = ['#306998', '#FFD43B', '#98c379']

for i, name in enumerate(target_names):
    mask = y == i
    ax1.scatter(
        X_pca[mask, 0], X_pca[mask, 1],
        c=colors[i], label=name, s=40, alpha=0.7
    )
    ax2.scatter(
        X_tsne[mask, 0], X_tsne[mask, 1],
        c=colors[i], label=name, s=40, alpha=0.7
    )

ax1.set_title('PCA Projection')
ax1.legend()
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')

ax2.set_title('t-SNE Projection')
ax2.legend()
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')

plt.suptitle('Iris Dataset: PCA vs t-SNE', fontsize=14)
plt.tight_layout()
plt.savefig('iris_pca_vs_tsne.png', dpi=150)
plt.show()

In the PCA projection, setosa separates cleanly from the other two species, but versicolor and virginica overlap considerably. The t-SNE projection, by contrast, typically produces three well-defined clusters with clearer boundaries between versicolor and virginica. This demonstrates t-SNE's ability to reveal structure that linear methods cannot.

Working with Larger Datasets

t-SNE's computational complexity is a significant practical concern. The default Barnes-Hut approximation in scikit-learn runs in O(N log N) time, which is a massive improvement over the exact algorithm's O(N^2) complexity. However, for datasets with tens of thousands of samples, even the Barnes-Hut method can take considerable time.

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import time

# For large datasets, use this optimized pipeline
def tsne_large_dataset(X, n_pca_components=50, perplexity=30):
    """
    Apply t-SNE to large datasets with PCA preprocessing.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input data.
    n_pca_components : int
        Number of PCA components to retain.
    perplexity : float
        t-SNE perplexity parameter.

    Returns
    -------
    X_embedded : ndarray of shape (n_samples, 2)
        The 2D embedding.
    """
    start = time.time()

    # PCA reduction first
    if X.shape[1] > n_pca_components:
        pca = PCA(n_components=n_pca_components, random_state=42)
        X_reduced = pca.fit_transform(X)
        variance = pca.explained_variance_ratio_.sum()
        print(f"PCA: {X.shape[1]} -> {n_pca_components} dims "
              f"({variance:.1%} variance retained)")
    else:
        X_reduced = X

    # t-SNE with Barnes-Hut approximation
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        learning_rate='auto',
        init='pca',
        method='barnes_hut',
        angle=0.5,
        n_jobs=-1,
        random_state=42,
        max_iter=1000
    )
    X_embedded = tsne.fit_transform(X_reduced)

    elapsed = time.time() - start
    print(f"t-SNE completed in {elapsed:.1f}s "
          f"(KL divergence: {tsne.kl_divergence_:.4f})")

    return X_embedded

The method='barnes_hut' parameter is the default and the recommended choice for datasets larger than a few hundred samples. The angle parameter (sometimes called theta) controls the trade-off between speed and accuracy in the Barnes-Hut approximation. Values between 0.2 and 0.8 work well, with the default of 0.5 offering a good balance. Setting n_jobs=-1 uses all available CPU cores for the nearest-neighbor search portion of the algorithm.

Warning

t-SNE does not have a transform() method in scikit-learn. You cannot fit on a training set and then transform new data points. Every call to fit_transform() computes an entirely new embedding. If you need to embed new points into an existing visualization, consider using the openTSNE library, which supports adding points to an existing embedding.

Common Pitfalls and How to Avoid Them

t-SNE is powerful but also easy to misinterpret. Understanding these common mistakes will help you avoid drawing false conclusions from your visualizations.

Cluster Sizes Are Not Meaningful

The relative sizes of clusters in a t-SNE plot do not reflect the density or spread of the original data. t-SNE tends to expand dense clusters and compress sparse ones, so two clusters that appear the same size in the visualization might have very different densities in the original space. Never draw conclusions about relative cluster sizes from a t-SNE plot.

Distances Between Clusters Are Not Reliable

The distance between two clusters in a t-SNE visualization does not reliably indicate how far apart those clusters are in the original data. Two clusters that appear close together might be quite distant in high-dimensional space, and vice versa. t-SNE preserves local neighborhoods, not global distances.

Always Run Multiple Times

Because t-SNE's cost function is non-convex, different random initializations can produce visually different embeddings. Always set a random_state for reproducibility, and consider running the algorithm multiple times with different seeds to verify that the patterns you observe are consistent rather than artifacts of a particular initialization.

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

# Load a more complex dataset
digits = load_digits()
X, y = digits.data, digits.target

# Run t-SNE with different random seeds
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
seeds = [42, 123, 7]

for ax, seed in zip(axes, seeds):
    tsne = TSNE(
        n_components=2,
        perplexity=30,
        random_state=seed,
        learning_rate='auto',
        init='pca'
    )
    X_embedded = tsne.fit_transform(X)

    scatter = ax.scatter(
        X_embedded[:, 0],
        X_embedded[:, 1],
        c=y,
        cmap='tab10',
        s=8,
        alpha=0.6
    )
    ax.set_title(f'random_state={seed}')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('t-SNE Stability Across Random Seeds', fontsize=14)
plt.tight_layout()
plt.savefig('tsne_stability.png', dpi=150)
plt.show()

With init='pca', the results tend to be more stable across runs than with random initialization. However, the overall arrangement and rotation of clusters may still differ even when the local structure is preserved consistently.

t-SNE vs PCA: When to Use Each

PCA and t-SNE serve different purposes and have complementary strengths. PCA is a linear technique that finds the directions of maximum variance. It is fast, deterministic, and produces interpretable components where each dimension corresponds to a linear combination of the original features. PCA preserves global structure and is well-suited for feature extraction, noise reduction, and as a preprocessing step for other algorithms.

t-SNE is a nonlinear technique optimized for visualization. It excels at preserving local neighborhoods and revealing clusters that are defined by complex, nonlinear relationships. However, it is slower, non-deterministic (without a fixed random seed), and the output dimensions have no interpretable meaning.

In practice, many workflows use both techniques together. PCA serves as a preprocessing step to reduce noise and computational cost, and then t-SNE produces the final visualization from the PCA-reduced representation. This combination leverages the strengths of both methods.

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
X, y = digits.data, digits.target

# Full pipeline: standardize, PCA, then t-SNE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# PCA alone
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# PCA + t-SNE
pca_50 = PCA(n_components=50, random_state=42)
X_pca_50 = pca_50.fit_transform(X)

tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    learning_rate='auto',
    init='pca'
)
X_tsne = tsne.fit_transform(X_pca_50)

# Visualize both
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

scatter1 = ax1.scatter(
    X_pca[:, 0], X_pca[:, 1],
    c=y, cmap='tab10', s=8, alpha=0.6
)
ax1.set_title('PCA (2 components)')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
plt.colorbar(scatter1, ax=ax1, label='Digit')

scatter2 = ax2.scatter(
    X_tsne[:, 0], X_tsne[:, 1],
    c=y, cmap='tab10', s=8, alpha=0.6
)
ax2.set_title('PCA (50) + t-SNE (2)')
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')
plt.colorbar(scatter2, ax=ax2, label='Digit')

plt.suptitle('Handwritten Digits: PCA vs t-SNE', fontsize=14)
plt.tight_layout()
plt.savefig('digits_pca_vs_tsne.png', dpi=150)
plt.show()

On the digits dataset, the difference is striking. PCA produces a projection where several digit classes overlap significantly, making it hard to distinguish between them visually. The t-SNE projection, however, separates the ten digit classes into well-defined, clearly visible clusters. This is exactly the kind of exploratory insight that makes t-SNE such a valuable tool.

Key Takeaways

t-SNE is a visualization technique: Use it to explore and understand high-dimensional data visually. It preserves local neighborhood structure and excels at revealing clusters, but the axes of the output have no interpretable meaning.
Perplexity matters: This parameter controls how many neighbors each point considers. Values between 5 and 50 are typical, and experimenting with different values for your specific dataset is essential. There is no single correct setting.
Always preprocess with PCA: For datasets with many features, reducing to 50 dimensions with PCA before applying t-SNE removes noise, speeds up computation, and often produces cleaner results.
Do not over-interpret the output: Cluster sizes and inter-cluster distances in a t-SNE plot do not reliably correspond to the original data structure. Focus on whether clusters form and separate, not on their specific visual properties.
Use init='pca' and learning_rate='auto': These settings, which are the defaults in current versions of scikit-learn, produce more stable and reproducible results than the older defaults of random initialization and a fixed learning rate.
Consider alternatives for embedding new data: scikit-learn's TSNE cannot transform unseen data points. If you need this capability, look into the openTSNE library, which supports adding new points to an existing embedding.

t-SNE remains one of the best tools available for making sense of complex, high-dimensional datasets. By understanding how the algorithm works, choosing appropriate parameters, and avoiding common misinterpretations, you can use it to uncover meaningful patterns that would otherwise remain hidden in the numbers. Start with the basic examples shown here, experiment with perplexity on your own data, and always remember that t-SNE is a lens for exploration -- not a source of definitive conclusions about your data's structure.