Linear correlation measures how closely the relationship between two variables resembles a straight line. Python gives you three powerful libraries to calculate it: NumPy, SciPy, and pandas. This guide walks through each one with practical code you can use right away.
Whether you are exploring a dataset for the first time, building a machine learning pipeline, or preparing a research report, understanding how two variables relate to each other is a fundamental skill. The Pearson correlation coefficient is the standard tool for quantifying that linear relationship, and Python makes it straightforward to compute.
What Is Linear Correlation?
Linear correlation describes how closely two variables move together in a straight-line pattern. The Pearson correlation coefficient, commonly written as r, is a single number between -1 and +1 that captures this relationship. It is calculated as the ratio of the covariance of the two variables to the product of their individual standard deviations.
Here is what the value of r tells you:
- r = 1 -- A perfect positive linear relationship. As one variable increases, the other increases by a proportional amount.
- r = -1 -- A perfect negative linear relationship. As one variable increases, the other decreases proportionally.
- r = 0 -- No linear relationship between the variables.
- 0 < r < 1 -- A positive trend. Higher values indicate a stronger linear pattern.
- -1 < r < 0 -- A negative trend. Values closer to -1 indicate a stronger inverse pattern.
A Pearson coefficient of zero does not always mean the variables are independent. It only means there is no linear relationship. Two variables can have a strong non-linear association (such as a parabolic curve) while still producing an r-value near zero.
Calculating Correlation with NumPy
NumPy provides the np.corrcoef() function, which returns a full correlation coefficient matrix. This is the quickest way to get a Pearson r-value between two arrays.
import numpy as np
# Sample data: hours studied vs. exam scores
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
scores = np.array([52, 58, 65, 68, 72, 78, 82, 89, 93, 97])
# Calculate the correlation matrix
correlation_matrix = np.corrcoef(hours, scores)
print(correlation_matrix)
The output is a 2x2 matrix. The diagonal values are always 1.0 (each variable correlates perfectly with itself), and the off-diagonal values give you the Pearson r between the two arrays.
# Output:
# [[1. 0.99427521]
# [0.99427521 1. ]]
# Extract just the correlation coefficient
r = correlation_matrix[0, 1]
print(f"Pearson r: {r:.4f}")
# Pearson r: 0.9943
An r-value of 0.9943 indicates a very strong positive linear relationship -- as study hours increase, exam scores increase in a nearly perfect straight line.
By default, np.corrcoef() treats each row as a variable and each column as an observation. If your data is arranged differently, set rowvar=False to treat columns as variables instead.
Correlating Multiple Variables at Once
You can pass a 2D array to np.corrcoef() to get pairwise correlations between all variables simultaneously.
import numpy as np
# Three variables: temperature, ice cream sales, sunscreen sales
temperature = np.array([68, 71, 75, 78, 82, 85, 89, 92, 95, 98])
ice_cream = np.array([120, 135, 155, 170, 195, 210, 240, 260, 285, 300])
sunscreen = np.array([50, 55, 65, 70, 80, 88, 100, 110, 120, 130])
# Stack them and compute the full correlation matrix
data = np.array([temperature, ice_cream, sunscreen])
corr_matrix = np.corrcoef(data)
print("Full correlation matrix:")
print(np.round(corr_matrix, 4))
This produces a 3x3 matrix where each cell shows the Pearson r between each pair of variables, making it easy to spot which relationships are the strongest.
Using SciPy for Pearson Correlation
While NumPy gives you the coefficient, scipy.stats.pearsonr() gives you the coefficient and a p-value in one call. The p-value tells you whether the observed correlation is statistically significant -- that is, how likely it is that an uncorrelated dataset would produce a coefficient at least this extreme.
import numpy as np
from scipy import stats
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
scores = np.array([52, 58, 65, 68, 72, 78, 82, 89, 93, 97])
# Calculate Pearson r and p-value
result = stats.pearsonr(hours, scores)
print(f"Pearson r: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.6f}")
The result is returned as a PearsonRResult named tuple with .statistic and .pvalue attributes. A very small p-value (typically less than 0.05) indicates that the correlation is statistically significant.
Confidence Intervals
SciPy also lets you compute a confidence interval for the correlation coefficient. This tells you the range within which the true population correlation likely falls.
# Get the 95% confidence interval
ci = result.confidence_interval(confidence_level=0.95)
print(f"95% CI: [{ci.low:.4f}, {ci.high:.4f}]")
By default, the confidence interval is computed using the Fisher transformation. You can also use bootstrap methods by passing an instance of stats.BootstrapMethod to the method parameter for more robust interval estimates.
Alternative Hypotheses
By default, pearsonr() performs a two-sided test (checking whether the correlation is significantly different from zero in either direction). You can specify a one-sided alternative if you have a directional hypothesis.
# Test if the correlation is significantly greater than zero
result_greater = stats.pearsonr(hours, scores, alternative='greater')
print(f"One-sided p-value: {result_greater.pvalue:.6f}")
# Test if the correlation is significantly less than zero
result_less = stats.pearsonr(hours, scores, alternative='less')
print(f"One-sided p-value: {result_less.pvalue:.6f}")
Permutation and Monte Carlo Methods
For non-standard distributions, SciPy supports computing the p-value using permutation tests or Monte Carlo simulation instead of the default analytical method.
rng = np.random.default_rng(42)
# Exact permutation test
method = stats.PermutationMethod(n_resamples=9999, random_state=rng)
result_perm = stats.pearsonr(hours, scores, method=method)
print(f"Permutation p-value: {result_perm.pvalue:.4f}")
# Monte Carlo test assuming uniform distributions
method_mc = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform))
result_mc = stats.pearsonr(hours, scores, method=method_mc)
print(f"Monte Carlo p-value: {result_mc.pvalue:.4f}")
SciPy now has experimental support for the Python Array API standard, meaning pearsonr() can accept arrays from CuPy, PyTorch, JAX, and Dask in addition to NumPy. Set the environment variable SCIPY_ARRAY_API=1 to enable this feature.
Correlation Matrices with pandas
When you are working with tabular data in a DataFrame, the .corr() method is the fastest path to a full correlation matrix.
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'hours_studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'exam_score': [52, 58, 65, 68, 72, 78, 82, 89, 93, 97],
'sleep_hours': [5, 6, 7, 7, 8, 6, 8, 9, 7, 8]
})
# Compute the Pearson correlation matrix
print(df.corr())
The output is a DataFrame where each cell contains the Pearson r between the corresponding pair of columns. This is especially useful for quickly scanning a dataset with many numerical features.
# Output:
# hours_studied exam_score sleep_hours
# hours_studied 1.000000 0.994275 0.690066
# exam_score 0.994275 1.000000 0.680440
# sleep_hours 0.690066 0.680440 1.000000
Spearman and Kendall Alternatives
The .corr() method also supports rank-based correlation methods. These are useful when the relationship between variables is monotonic but not necessarily linear.
# Spearman rank correlation
print(df.corr(method='spearman'))
# Kendall tau correlation
print(df.corr(method='kendall'))
Spearman correlation computes the Pearson coefficient on the ranked values of each variable rather than the raw values. This makes it less sensitive to outliers. Kendall tau measures the agreement between the rankings of two variables, counting concordant and discordant pairs.
Correlating Specific Columns
If you only need the correlation between two specific columns, use the .corr() method on a Series.
# Correlation between two specific columns
r = df['hours_studied'].corr(df['exam_score'])
print(f"Pearson r: {r:.4f}")
Linear Regression and the r-value
SciPy's linregress() function performs a linear least-squares regression and returns the Pearson r as part of its output. This gives you the correlation coefficient alongside the slope, intercept, and standard error all in one call.
from scipy import stats
import numpy as np
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
scores = np.array([52, 58, 65, 68, 72, 78, 82, 89, 93, 97])
result = stats.linregress(hours, scores)
print(f"Slope: {result.slope:.4f}")
print(f"Intercept: {result.intercept:.4f}")
print(f"r-value: {result.rvalue:.4f}")
print(f"p-value: {result.pvalue:.6f}")
print(f"Std error: {result.stderr:.4f}")
The rvalue returned by linregress() is the same Pearson r you would get from pearsonr() or np.corrcoef(). The square of this value (rvalue ** 2) is the coefficient of determination, commonly written as R-squared. It tells you the proportion of variance in the dependent variable that is explained by the linear model.
# Coefficient of determination
r_squared = result.rvalue ** 2
print(f"R-squared: {r_squared:.4f}")
# R-squared: 0.9886 -- the model explains ~99% of the variance
The linregress() function also returns intercept_stderr (the standard error of the intercept). Access it as result.intercept_stderr -- this attribute is not available if you unpack the result as a 5-element tuple.
Handling NaN Values
Real-world datasets frequently contain missing values. Each library handles them differently, so it is important to know what to expect.
NumPy
The np.corrcoef() function propagates NaN values. If either array contains a NaN, the corresponding entries in the correlation matrix will be NaN.
import numpy as np
x = np.array([1, 2, 3, np.nan, 5])
y = np.array([2, 4, 6, 8, 10])
print(np.corrcoef(x, y))
# All values will be NaN
SciPy
The pearsonr() function raises a ValueError if either input contains NaN values. You need to remove or handle missing data before calling it.
from scipy import stats
import numpy as np
x = np.array([1, 2, 3, np.nan, 5])
y = np.array([2, 4, 6, 8, 10])
# Remove pairs where either value is NaN
mask = ~np.isnan(x) & ~np.isnan(y)
result = stats.pearsonr(x[mask], y[mask])
print(f"Pearson r: {result.statistic:.4f}")
pandas
The .corr() method in pandas automatically excludes NaN values on a pairwise basis. This is the simplest approach when working with messy data.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'x': [1, 2, 3, np.nan, 5],
'y': [2, 4, 6, 8, 10]
})
# pandas handles NaN automatically
print(df.corr())
# Computes correlation using only the non-NaN pairs
When pandas uses pairwise NaN exclusion on a large correlation matrix, different pairs of columns may use different subsets of rows. This can occasionally produce a correlation matrix that is not positive semi-definite. If you need a consistent matrix (for example, as input to a machine learning model), consider dropping all rows with any NaN values before computing correlations.
Common Pitfalls
Correlation analysis is straightforward in Python, but there are several traps that can lead to misleading conclusions.
Correlation is not causation. A high Pearson r between two variables does not mean one causes the other. Both may be driven by a third, unmeasured variable. For example, ice cream sales and drowning incidents are correlated, but both are driven by hot weather.
Pearson only captures linear relationships. If two variables have a strong quadratic or exponential relationship, the Pearson coefficient may be close to zero even though the variables are clearly associated. In such cases, Spearman rank correlation is often more appropriate.
Outliers distort the coefficient. A single extreme data point can dramatically inflate or deflate the Pearson r. Always visualize your data with a scatter plot before interpreting the coefficient.
import numpy as np
from scipy import stats
# Strong correlation without the outlier
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
print(f"Without outlier: r = {stats.pearsonr(x, y).statistic:.4f}")
# r = 1.0000
# Add a single outlier
x_out = np.append(x, 11)
y_out = np.append(y, 100)
print(f"With outlier: r = {stats.pearsonr(x_out, y_out).statistic:.4f}")
# r drops noticeably
Small sample sizes are unreliable. With very few data points, you can get a high r-value purely by chance. Always check the p-value and consider whether your sample size is large enough to draw meaningful conclusions. The pearsonr() function requires at least two observations and will raise an error if an input array is constant.
Constant arrays break the calculation. If either variable has zero variance (all values are the same), the correlation coefficient is undefined. Both np.corrcoef() and stats.pearsonr() will return NaN in this case.
Key Takeaways
- NumPy for quick calculations: Use
np.corrcoef()when you need the Pearson r between two arrays and do not need a p-value. It returns a full correlation matrix and handles multiple variables at once. - SciPy for statistical rigor: Use
scipy.stats.pearsonr()when you need both the correlation coefficient and its statistical significance. It also provides confidence intervals, alternative hypothesis testing, and permutation-based p-values. - pandas for tabular data: Use
DataFrame.corr()when working with DataFrames. It handles NaN values automatically and supports Pearson, Spearman, and Kendall methods through a single parameter. - Always visualize first: Before interpreting any correlation coefficient, plot the data. A scatter plot reveals non-linear patterns, outliers, and clusters that a single number cannot capture.
- Check assumptions: The Pearson coefficient assumes a linear relationship and is sensitive to outliers. For non-linear or ordinal data, consider Spearman or Kendall rank correlation instead.
Linear correlation is one of the foundational tools in data analysis, and Python's ecosystem makes it accessible whether you are working with raw arrays or complex DataFrames. Start with a scatter plot, pick the right library for your needs, and always pair your coefficient with a significance test before drawing conclusions.