Python: Discover Hidden Relationships Between Variables in Large Datasets

Large datasets are full of connections that standard analysis can miss entirely. Two variables might share a complex, non-linear dependency that Pearson correlation scores as zero. Python gives you a full toolkit to expose these hidden patterns, from classical correlation matrices to advanced techniques like mutual information, distance correlation, and dimensionality reduction. This article walks through each approach with working code you can apply to your own data.

When you look at a dataset with dozens or hundreds of columns, the relationships between those columns are not always obvious. A sales figure might rise and fall in sync with weather patterns, but through a non-linear curve that linear correlation completely ignores. A user behavior metric might depend on two features simultaneously in a way that pairwise analysis never reveals. The techniques covered here give you the tools to catch what traditional statistics miss.

Why Standard Correlation Falls Short

Pearson correlation is the default tool that pandas hands you when you call .corr(). It measures linear association between two variables on a scale from -1 to 1. When the relationship between two variables is a straight line, Pearson works perfectly. The problem is that real-world data rarely cooperates.

Consider a simple example: y = x**2. The variable y is entirely determined by x, yet the Pearson correlation between them is essentially zero when x is symmetrically distributed around the origin. The relationship is there, it is strong, and it is perfectly deterministic. But because it is not linear, Pearson misses it completely.

import numpy as np

x = np.random.uniform(-1, 1, size=1000)
y = x ** 2 + np.random.normal(0, 0.05, size=1000)

pearson_corr = np.corrcoef(x, y)[0, 1]
print(f"Pearson correlation: {pearson_corr:.4f}")
# Output: Pearson correlation: ~0.01 (near zero despite strong dependency)

This is the core problem. Any technique that relies solely on linear correlation will miss quadratic, sinusoidal, exponential, and other non-linear dependencies hiding in your data. The rest of this article covers the Python tools that catch what Pearson cannot.

Note

The Datasaurus Dozen is a well-known illustration of this problem: twelve completely different scatter plots that all share the same Pearson correlation, mean, and standard deviation. The takeaway is clear — never rely on a single summary statistic to understand your data.

Pearson, Spearman, and Kendall: The Foundation

Before reaching for more advanced tools, it is worth understanding what the three classical correlation coefficients actually measure and where each one is appropriate.

Pearson measures linear association. It assumes continuous variables and is sensitive to outliers. Use it when you expect straight-line relationships.

Spearman ranks the data first, then applies Pearson to the ranks. This means it captures monotonic relationships (always increasing or always decreasing) even when the relationship is not linear. It is more robust to outliers than Pearson.

Kendall's Tau counts concordant and discordant pairs of observations. It also captures monotonic relationships, tends to be more robust with small sample sizes, and produces smaller correlation values than Spearman for the same data.

import pandas as pd
import numpy as np

# Generate sample data with a monotonic but non-linear relationship
np.random.seed(42)
data = pd.DataFrame({
    'temperature': np.random.uniform(60, 100, 500),
    'ice_cream_sales': np.zeros(500),
    'sunscreen_sales': np.zeros(500)
})
data['ice_cream_sales'] = np.exp(0.04 * data['temperature']) + np.random.normal(0, 2, 500)
data['sunscreen_sales'] = data['temperature'] ** 1.5 / 100 + np.random.normal(0, 0.5, 500)

# Compare all three methods
print("Pearson correlation matrix:")
print(data.corr(method='pearson').round(3))
print("\nSpearman correlation matrix:")
print(data.corr(method='spearman').round(3))
print("\nKendall correlation matrix:")
print(data.corr(method='kendall').round(3))

In the example above, the exponential relationship between temperature and ice cream sales will produce a lower Pearson coefficient than Spearman, because the relationship is monotonic but not linear. Spearman captures the ordering better in this case.

Pro Tip

Always compute both Pearson and Spearman on your dataset. If the Spearman value is significantly higher than Pearson for a variable pair, you likely have a non-linear monotonic relationship that deserves further investigation.

Mutual Information: Capturing Any Dependency

Mutual information (MI) comes from information theory and measures how much knowing one variable reduces the uncertainty about another. Unlike correlation, MI captures any type of statistical dependency, whether linear, quadratic, periodic, or otherwise. A mutual information value of zero means the variables are truly independent. Any value above zero indicates that they share some information.

The scikit-learn library provides two functions for computing MI: mutual_info_regression for continuous target variables and mutual_info_classif for categorical targets. Both use a k-nearest neighbors estimator under the hood to handle continuous variables without requiring you to bin the data manually.

from sklearn.feature_selection import mutual_info_regression
import numpy as np
import pandas as pd

np.random.seed(42)
n = 2000

# Create variables with different relationship types
x1 = np.random.uniform(-3, 3, n)
x2 = np.random.uniform(-3, 3, n)
x3 = np.random.uniform(-3, 3, n)
noise = np.random.normal(0, 0.5, n)

y = (
    2.0 * x1                  # linear relationship
    + np.sin(3 * x2) * 3      # sinusoidal relationship
    + x3 ** 2                 # quadratic relationship
    + noise
)

features = pd.DataFrame({'linear': x1, 'sinusoidal': x2, 'quadratic': x3})

# Compute mutual information scores
mi_scores = mutual_info_regression(features, y, random_state=42)
mi_results = pd.Series(mi_scores, index=features.columns).sort_values(ascending=False)

print("Mutual Information Scores:")
print(mi_results.round(3))

# Compare with Pearson correlation
print("\nPearson Correlation with target:")
for col in features.columns:
    print(f"  {col}: {np.corrcoef(features[col], y)[0, 1]:.3f}")

The output reveals the key strength of mutual information. Pearson correlation will report a near-zero value for the sinusoidal relationship with x2 and will underestimate the quadratic relationship with x3. Mutual information, by contrast, will correctly rank all three features as informative because it does not assume any particular functional form.

Note

One limitation of MI is that it does not tell you the direction or shape of the relationship. It only tells you that a dependency exists and how strong it is. Once MI identifies an interesting pair, you will want to plot the variables against each other to understand what the relationship actually looks like.

Building a Full MI Matrix

Computing MI between all pairs of variables in a dataset requires iterating through each pair, since scikit-learn only computes MI between features and a target. Here is how to build a complete pairwise MI matrix:

from sklearn.feature_selection import mutual_info_regression
import numpy as np
import pandas as pd

def compute_mi_matrix(df, random_state=42):
    """Compute pairwise mutual information matrix for all columns."""
    cols = df.columns
    n = len(cols)
    mi_matrix = pd.DataFrame(np.zeros((n, n)), index=cols, columns=cols)

    for i in range(n):
        for j in range(i, n):
            if i == j:
                mi_matrix.iloc[i, j] = 1.0
            else:
                mi_val = mutual_info_regression(
                    df[[cols[i]]],
                    df[cols[j]],
                    random_state=random_state
                )[0]
                mi_matrix.iloc[i, j] = mi_val
                mi_matrix.iloc[j, i] = mi_val

    return mi_matrix

# Example usage
np.random.seed(42)
sample_data = pd.DataFrame({
    'A': np.random.normal(0, 1, 1000),
    'B': np.random.normal(0, 1, 1000),
})
sample_data['C'] = sample_data['A'] ** 2 + np.random.normal(0, 0.1, 1000)
sample_data['D'] = np.sin(sample_data['B'] * 2) + np.random.normal(0, 0.1, 1000)

mi_matrix = compute_mi_matrix(sample_data)
print("Pairwise Mutual Information Matrix:")
print(mi_matrix.round(3))

Distance Correlation: The Non-Linear Detector

Distance correlation is a dependency measure that equals zero if and only if two variables are statistically independent. This is a stronger guarantee than what Pearson or even Spearman provides. The dcor library makes this computation straightforward in Python.

The core idea behind distance correlation is to compute pairwise distances between all observations for each variable separately, then measure whether those distance patterns are related. If knowing the distances between observations in variable X tells you something about the distances in variable Y, the two variables are dependent.

import dcor
import numpy as np

np.random.seed(42)
n = 500

# Linear relationship
x_linear = np.random.uniform(-1, 1, n)
y_linear = 2 * x_linear + np.random.normal(0, 0.3, n)

# Quadratic relationship (Pearson misses this)
x_quad = np.random.uniform(-1, 1, n)
y_quad = x_quad ** 2 + np.random.normal(0, 0.05, n)

# Sinusoidal relationship (Pearson misses this)
x_sin = np.random.uniform(-2, 2, n)
y_sin = np.sin(x_sin * np.pi) + np.random.normal(0, 0.1, n)

# Independent variables (no relationship)
x_indep = np.random.uniform(-1, 1, n)
y_indep = np.random.normal(0, 1, n)

relationships = {
    'Linear': (x_linear, y_linear),
    'Quadratic': (x_quad, y_quad),
    'Sinusoidal': (x_sin, y_sin),
    'Independent': (x_indep, y_indep),
}

print(f"{'Relationship':<15} {'Pearson':>10} {'Distance Corr':>15}")
print("-" * 42)
for name, (x, y) in relationships.items():
    pearson = np.corrcoef(x, y)[0, 1]
    dcorr = dcor.distance_correlation(x, y)
    print(f"{name:<15} {pearson:>10.4f} {dcorr:>15.4f}")

The output will show that distance correlation correctly assigns high values to the quadratic and sinusoidal relationships where Pearson reports near-zero values. For independent variables, distance correlation will also be near zero, confirming the absence of any dependency.

Pro Tip

Distance correlation ranges from 0 to 1 and does not indicate direction (positive or negative). If you need directional information, combine it with Spearman or a scatter plot. The dcor library also provides a distance_correlation_t_test function that returns a p-value, so you can determine whether the detected relationship is statistically significant.

Performance Warning

The naive distance correlation algorithm has O(n^2) time complexity. For large datasets with tens of thousands of rows, consider sampling your data first or using the fast O(n log n) algorithm available in dcor for one-dimensional data by passing method="avl".

Phi_K: Mixed Variable Types Made Simple

Real datasets often contain a mix of numerical, categorical, and ordinal columns. Pearson correlation only handles numerical pairs. Spearman works with ordinal data. Neither handles a categorical column like "region" or "product_category" well. The Phi_K correlation coefficient was designed to work consistently across all variable types.

Phi_K is based on a refinement of Pearson's chi-squared test of independence. It interprets the test statistic from a contingency table as if it came from a rotated bivariate normal distribution, where the angle of rotation is interpreted as the correlation. The result is a coefficient that ranges from 0 to 1, captures non-linear dependencies, and handles categorical, ordinal, and continuous variables interchangeably.

import pandas as pd
import numpy as np
import phik
from phik import resources

# Load the built-in example dataset
df = pd.read_csv(resources.fixture('fake_insurance_data.csv.gz'))
print("Columns:", df.columns.tolist())
print("Shape:", df.shape)

# Compute the Phi_K correlation matrix across all variable types
phik_matrix = df.phik_matrix()
print("\nPhi_K Correlation Matrix:")
print(phik_matrix.round(3))

# Identify the most strongly correlated pairs
# Get global correlations (average Phi_K for each variable)
global_corr = df.global_phik()
print("\nGlobal Phi_K correlations:")
print(global_corr.round(3))

# Significance matrix shows which relationships are statistically meaningful
sig_matrix = df.significance_matrix()
print("\nSignificance matrix (Z-scores):")
print(sig_matrix.round(2))

Phi_K extends pandas DataFrames directly through the phik import, adding methods like .phik_matrix(), .significance_matrix(), and .global_phik() that you can call on any DataFrame. The significance matrix is especially useful because it tells you which correlations are statistically meaningful versus those that could have arisen by chance.

import pandas as pd
import numpy as np
import phik

# Build your own mixed-type dataset
np.random.seed(42)
n = 1000

mixed_data = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.8, n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'education': np.random.choice(
        ['High School', 'Bachelors', 'Masters', 'PhD'], n,
        p=[0.3, 0.4, 0.2, 0.1]
    ),
    'satisfaction_score': np.random.randint(1, 6, n),
})

# Make income depend on education
edu_bonus = {'High School': 0, 'Bachelors': 0.3, 'Masters': 0.6, 'PhD': 0.9}
mixed_data['income'] *= mixed_data['education'].map(edu_bonus) + 1

# Compute Phi_K and find the strongest relationships
phik_result = mixed_data.phik_matrix()
print("Phi_K Matrix for Mixed Data:")
print(phik_result.round(3))

Dimensionality Reduction: PCA, t-SNE, and UMAP

When your dataset has dozens or hundreds of variables, pairwise analysis becomes unwieldy. Dimensionality reduction techniques compress your data into a lower-dimensional space, and in doing so, they expose the structure and groupings that you would never find by examining variable pairs one at a time.

PCA: Finding Linear Structure

Principal Component Analysis finds the directions (components) in your data that explain the largest variance. Each component is a linear combination of your original variables, and the weights in that combination tell you which variables are driving the patterns in your data.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

np.random.seed(42)
n = 1000

# Create a dataset where some variables are secretly related
base_signal_1 = np.random.normal(0, 1, n)
base_signal_2 = np.random.normal(0, 1, n)

data = pd.DataFrame({
    'feature_A': base_signal_1 + np.random.normal(0, 0.2, n),
    'feature_B': base_signal_1 * 0.8 + np.random.normal(0, 0.3, n),
    'feature_C': base_signal_1 * -0.5 + np.random.normal(0, 0.2, n),
    'feature_D': base_signal_2 + np.random.normal(0, 0.1, n),
    'feature_E': base_signal_2 * 1.2 + np.random.normal(0, 0.3, n),
    'feature_F': np.random.normal(0, 1, n),  # noise / unrelated
})

# Standardize before PCA
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Fit PCA
pca = PCA()
pca.fit(data_scaled)

# Examine explained variance
print("Explained variance ratio per component:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {var:.3f} ({var*100:.1f}%)")

# Examine which features load on each component
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(len(data.columns))],
    index=data.columns
)
print("\nFeature loadings (first 3 components):")
print(loadings.iloc[:, :3].round(3))

In this example, PCA will reveal that the first two components explain the bulk of the variance. The loadings will show that features A, B, and C load heavily on the first component (because they all stem from base_signal_1) while features D and E load on the second component. Feature F, being pure noise, will not load strongly on any component. This is how PCA exposes the hidden groupings among your variables.

t-SNE and UMAP: Revealing Non-Linear Clusters

PCA is limited to linear relationships. For datasets with complex, non-linear structure, t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are far more powerful visualization tools.

t-SNE excels at preserving local neighborhood structure, meaning points that are close together in high-dimensional space will remain close in the 2D projection. It is the go-to tool for visualizing clusters but can be slow on large datasets and does not preserve global distances well.

UMAP is a more recent algorithm that aims to preserve both local and global structure. It is significantly faster than t-SNE, scales better to large datasets, and produces embeddings that you can use for downstream machine learning tasks (not just visualization).

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np

# Load a real high-dimensional dataset
digits = load_digits()
X = digits.data       # 1797 samples, 64 features (8x8 pixel images)
y = digits.target     # digit labels 0-9

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Optional: reduce dimensions with PCA first (speeds up t-SNE)
pca_50 = PCA(n_components=50)
X_pca = pca_50.fit_transform(X_scaled)

# Apply t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000,
    learning_rate='auto',
    init='pca'
)
X_tsne = tsne.fit_transform(X_pca)

print(f"t-SNE output shape: {X_tsne.shape}")
print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced to: {X_tsne.shape[1]} dimensions")

# Each cluster in the 2D output corresponds to a digit
# Points that are near each other represent similar images
for digit in range(10):
    mask = y == digit
    centroid = X_tsne[mask].mean(axis=0)
    print(f"  Digit {digit} centroid: ({centroid[0]:.1f}, {centroid[1]:.1f})")

import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np

digits = load_digits()
X_scaled = StandardScaler().fit_transform(digits.data)

# Apply UMAP
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)
X_umap = reducer.fit_transform(X_scaled)

print(f"UMAP output shape: {X_umap.shape}")

# UMAP often produces tighter, more separated clusters than t-SNE
for digit in range(10):
    mask = digits.target == digit
    spread = X_umap[mask].std(axis=0).mean()
    print(f"  Digit {digit} cluster spread: {spread:.3f}")

Note

Install UMAP with pip install umap-learn (not pip install umap, which is a different package). When working with datasets larger than a few thousand rows, UMAP is generally the better choice over t-SNE due to its superior scaling properties. For datasets exceeding 100,000 rows, consider using PCA as a preprocessing step to reduce dimensions to around 50 before applying UMAP.

Putting It All Together: A Complete Workflow

Here is a practical workflow that combines multiple techniques to thoroughly explore the relationships in a dataset. The idea is to start broad and then focus in on the pairs and patterns that matter.

import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import dcor

def explore_relationships(df, target_col=None, sample_size=2000):
    """
    Comprehensive relationship discovery workflow.

    Parameters
    ----------
    df : pd.DataFrame
        Input dataset.
    target_col : str, optional
        Name of the target column for supervised analysis.
    sample_size : int
        Maximum rows for expensive computations.
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

    # Subsample for expensive operations
    if len(df) > sample_size:
        df_sample = df.sample(sample_size, random_state=42)
    else:
        df_sample = df.copy()

    print("=" * 60)
    print("STEP 1: Classical Correlation Analysis")
    print("=" * 60)

    pearson = df_sample[numeric_cols].corr(method='pearson')
    spearman = df_sample[numeric_cols].corr(method='spearman')

    # Find pairs where Spearman >> Pearson (non-linear monotonic)
    diff = (spearman.abs() - pearson.abs())
    print("\nVariable pairs with potential non-linear monotonic relationships")
    print("(Spearman significantly exceeds Pearson):\n")

    for i in range(len(numeric_cols)):
        for j in range(i + 1, len(numeric_cols)):
            d = diff.iloc[i, j]
            if d > 0.1:
                col_i, col_j = numeric_cols[i], numeric_cols[j]
                print(f"  {col_i} vs {col_j}:")
                print(f"    Pearson={pearson.iloc[i,j]:.3f}, "
                      f"Spearman={spearman.iloc[i,j]:.3f}, "
                      f"Diff={d:.3f}")

    print("\n" + "=" * 60)
    print("STEP 2: Mutual Information Analysis")
    print("=" * 60)

    if target_col and target_col in numeric_cols:
        feature_cols = [c for c in numeric_cols if c != target_col]
        mi_scores = mutual_info_regression(
            df_sample[feature_cols],
            df_sample[target_col],
            random_state=42
        )
        mi_ranked = pd.Series(
            mi_scores, index=feature_cols
        ).sort_values(ascending=False)

        print(f"\nMutual Information with '{target_col}':\n")
        for feat, score in mi_ranked.items():
            bar = "#" * int(score * 20)
            print(f"  {feat:<25s} {score:.3f}  {bar}")

    print("\n" + "=" * 60)
    print("STEP 3: Distance Correlation (Non-Linear Detection)")
    print("=" * 60)

    # Compute distance correlation for top pairs
    print("\nPairwise distance correlations:\n")
    dcor_results = []
    for i in range(len(numeric_cols)):
        for j in range(i + 1, len(numeric_cols)):
            col_i, col_j = numeric_cols[i], numeric_cols[j]
            dc = dcor.distance_correlation(
                df_sample[col_i].values,
                df_sample[col_j].values
            )
            pc = abs(pearson.iloc[i, j])
            dcor_results.append((col_i, col_j, dc, pc))

    dcor_results.sort(key=lambda x: x[2], reverse=True)
    for col_i, col_j, dc, pc in dcor_results[:10]:
        flag = " << NON-LINEAR" if dc > pc + 0.15 else ""
        print(f"  {col_i} vs {col_j}: dCor={dc:.3f}, "
              f"|Pearson|={pc:.3f}{flag}")

    print("\n" + "=" * 60)
    print("STEP 4: PCA for Variable Grouping")
    print("=" * 60)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df_sample[numeric_cols])
    pca = PCA()
    pca.fit(X_scaled)

    cumulative = np.cumsum(pca.explained_variance_ratio_)
    n_components_90 = np.searchsorted(cumulative, 0.90) + 1
    print(f"\nComponents needed for 90% variance: {n_components_90} "
          f"(out of {len(numeric_cols)} features)")

    loadings = pd.DataFrame(
        pca.components_[:3].T,
        columns=['PC1', 'PC2', 'PC3'],
        index=numeric_cols
    )
    print("\nTop loadings per component:")
    for pc in ['PC1', 'PC2', 'PC3']:
        top = loadings[pc].abs().nlargest(3)
        print(f"\n  {pc}:")
        for feat, val in top.items():
            sign = "+" if loadings.loc[feat, pc] > 0 else "-"
            print(f"    {sign}{feat}: {abs(loadings.loc[feat, pc]):.3f}")

    return {
        'pearson': pearson,
        'spearman': spearman,
        'dcor_results': dcor_results,
        'pca_loadings': loadings
    }

Call this function on any pandas DataFrame and it will walk through each stage, flagging non-linear relationships, ranking features by mutual information, identifying pairs with high distance correlation but low Pearson values, and showing you which variables cluster together through PCA loadings.

# Example: apply the workflow to a synthetic dataset
np.random.seed(42)
n = 1500
demo_data = pd.DataFrame({
    'base_metric': np.random.normal(50, 10, n),
    'noise': np.random.normal(0, 5, n),
})
demo_data['linear_dep'] = demo_data['base_metric'] * 1.5 + np.random.normal(0, 3, n)
demo_data['quadratic_dep'] = (demo_data['base_metric'] - 50) ** 2 / 20 + np.random.normal(0, 2, n)
demo_data['sin_dep'] = np.sin(demo_data['base_metric'] / 5) * 10 + np.random.normal(0, 1, n)
demo_data['target'] = (
    demo_data['base_metric'] * 0.5
    + demo_data['quadratic_dep'] * 0.3
    + np.random.normal(0, 5, n)
)

results = explore_relationships(demo_data, target_col='target')

Key Takeaways

Start with Pearson and Spearman together. Comparing them side by side immediately reveals which variable pairs have non-linear monotonic relationships. If Spearman is much higher than Pearson, investigate further.
Use mutual information for feature ranking. When you have a target variable, MI from scikit-learn gives you a dependency-agnostic ranking that catches what correlation misses. It works with both continuous and discrete features.
Deploy distance correlation for non-linear detection. The dcor library provides a coefficient that is zero if and only if variables are independent. Any pair with high distance correlation but low Pearson value has a hidden non-linear relationship worth investigating.
Consider Phi_K for mixed data types. When your dataset combines numerical, categorical, and ordinal columns, Phi_K from the phik library gives you a single, consistent correlation metric across all types.
Use PCA for variable grouping and UMAP for cluster visualization. PCA loadings reveal which variables move together. UMAP (or t-SNE for smaller datasets) projects high-dimensional data into 2D, making hidden clusters and groupings visible.
Combine techniques in a layered workflow. No single tool catches everything. Start broad with classical correlation, then layer on MI and distance correlation for non-linear detection, and finish with dimensionality reduction for a structural overview.

The Python ecosystem for relationship discovery continues to expand. Libraries like dcor, phik, and umap-learn have matured significantly and work seamlessly alongside the established pandas and scikit-learn stack. By combining these tools into a systematic workflow, you can uncover the dependencies that standard analysis misses entirely, and make better decisions about which variables actually matter in your data.