Understanding Bayesian: The Mathematics of Changing Your Mind

A Presbyterian minister dies in 1761. He leaves behind an unpublished manuscript that his friend edits for two years before reading it aloud at the Royal Society on December 23, 1763. The idea in that manuscript — one simple rule for updating beliefs in the face of evidence — spends the next 150 years being alternately championed, suppressed, used in secret to win wars, and declared intellectually taboo. Then computers arrive, and everything changes.

That minister was Thomas Bayes, and the idea is Bayes' theorem. If you write Python for data science, machine learning, or statistical analysis, you cannot avoid it. It is the engine behind PyMC, pgmpy, scikit-learn's BernoulliNB and GaussianNB, spam filters, medical diagnostic systems, and the probabilistic programming paradigm that has reshaped how scientists think about uncertainty. But understanding Bayesian methods — really understanding them, not just calling model.fit() — requires grasping why this 260-year-old theorem keeps winning arguments against seemingly simpler alternatives.

This article implements Bayesian thinking in Python at every step and explains why the debate between Bayesian and frequentist statistics matters for anyone writing code that makes decisions under uncertainty.

The Theorem Itself

Bayes' theorem can be written in one line:

P(H|D) = P(D|H) * P(H) / P(D)

In words: the probability of a hypothesis given the data equals the probability of the data given the hypothesis, multiplied by the prior probability of the hypothesis, divided by the total probability of the data.

That is the entire theorem. Every Bayesian method in Python — from a naive Bayes classifier to a hierarchical model with 10,000 parameters sampled via Markov Chain Monte Carlo — is an elaboration of this single equation.

Let us make it concrete with Python:

def bayes_theorem(prior, likelihood, marginal_likelihood):
    """
    Compute the posterior probability using Bayes' theorem.

    Parameters:
        prior: P(H) - our belief before seeing data
        likelihood: P(D|H) - probability of data given hypothesis
        marginal_likelihood: P(D) - total probability of data
    Returns:
        posterior: P(H|D) - updated belief after seeing data
    """
    return (likelihood * prior) / marginal_likelihood


# A disease affects 1 in 1000 people
prior_disease = 0.001
prior_healthy = 0.999

# A test is 99% sensitive (true positive rate)
# and 95% specific (true negative rate)
sensitivity = 0.99       # P(positive | disease)
false_positive = 0.05    # P(positive | healthy)

# P(positive) = P(positive|disease)*P(disease) + P(positive|healthy)*P(healthy)
p_positive = sensitivity * prior_disease + false_positive * prior_healthy

# If you test positive, what is the probability you actually have the disease?
posterior = bayes_theorem(prior_disease, sensitivity, p_positive)
print(f"P(disease | positive test) = {posterior:.4f}")
# Output: P(disease | positive test) = 0.0194

Note

The result shocks people encountering it for the first time. A 99%-sensitive test returns positive, but there is only about a 2% chance you actually have the disease. The low base rate (1 in 1000) means that the false positives from the 999 healthy people vastly outnumber the true positives from the 1 sick person. This is not a flaw in Bayesian reasoning — it is Bayesian reasoning catching a flaw in naive intuition that kills people when doctors order unnecessary biopsies.

The Historical Drama

Understanding why Bayesian methods matter requires understanding why they were suppressed. The story, meticulously documented in Sharon Bertsch McGrayne's 2011 book The Theory That Would Not Die (Yale University Press), reads less like a statistics textbook and more like intellectual espionage.

Thomas Bayes himself was not confident enough in his result to publish it. After his death in 1761, his family passed his papers to Richard Price, a fellow nonconformist minister and mathematician. As the historian Stephen Stigler argued in his 2018 paper "Richard Price, the First Bayesian" (Statistical Science, Vol. 33, No. 1), Price's contribution was far more than editorial: he wrote roughly half of the published paper, provided the first practical application of the theory, and framed it within a theological argument about the existence of stable laws in nature. Price communicated the essay to John Canton at the Royal Society in a letter dated November 10, 1763, where he described Bayes's manuscript as having "great merit" and being well deserving of preservation.

Pierre-Simon Laplace independently derived the same result in the 1780s and extended it dramatically, building what he considered a general framework for reasoning under uncertainty. Laplace viewed probability as, in his own words, "nothing but good common sense reduced to mathematics." For roughly a century, this Bayesian approach — reasoning from data to the probability of causes — was simply how statistics worked.

Then Ronald Fisher happened.

In the early twentieth century, Fisher, Jerzy Neyman, and Egon Pearson developed what we now call frequentist statistics: p-values, confidence intervals, hypothesis testing. Their framework was computationally tractable on the desk calculators of the era (McGrayne notes that many of Fisher's innovations were, at their core, solutions to computational limitations). It was also philosophically "objective" in a way that appealed to scientists: no priors, no subjective beliefs, just long-run frequencies. For much of the twentieth century, Bayesian methods were considered professionally taboo in academic statistics.

What changed was computation. The combination of Markov Chain Monte Carlo algorithms (developed in the 1940s for nuclear weapons simulations at Los Alamos, then adapted for statistics in the 1990s by Gelfand and Smith) and modern computing power made it feasible to compute posterior distributions for complex models that would have been analytically intractable. McGrayne's book describes this combination of Bayesian methods and MCMC as having been called "arguably the most powerful mechanism ever created for processing data and knowledge."

Bayesian vs. Frequentist: What the Debate Actually Means

The disagreement is not about which formula to use. It is about what probability means.

A frequentist interprets probability as a long-run frequency. The probability of heads is 0.5 because, if you flip a coin many times, roughly half will be heads. Under this view, parameters are fixed but unknown constants. You cannot assign a probability to a parameter — it either is or is not 2.7. You can only describe the behavior of your estimator across hypothetical repeated samples. A 95% confidence interval does not mean there is a 95% probability the parameter is inside the interval. It means that if you repeated the experiment many times, 95% of the intervals you construct would contain the true parameter.

A Bayesian interprets probability as a degree of belief. You can absolutely assign a probability to a parameter — it expresses your uncertainty about its value. A 95% credible interval means exactly what people want it to mean: there is a 95% probability that the parameter lies within this range, given the data and your prior assumptions.

As the statisticians Wagenmakers, Lee, Lodewyckx, and Iverson argued in their influential chapter "Bayesian Versus Frequentist Inference" (2008), the criticism of frequentist methods centers on their tendency to provide answers to questions that practitioners did not ask. A researcher typically wants to know: "Given my data, how likely is this hypothesis?" Frequentist methods answer: "Given this hypothesis, how likely is this data?" Those are different questions, and confusing them is the source of the misinterpretation of p-values that pervades published science.

Andrew Gelman, one of the field's leading practitioners and author of Bayesian Data Analysis (now in its third edition), has argued a more nuanced position. In a 2018 blog post titled "Bayesians are frequentists," he wrote that "the Bayesian prior distribution corresponds to the frequentist sample space: it's the set of problems for which a particular statistical model or procedure will be applied." His view is that the dichotomy is less stark than partisans on either side suggest, and that good statistical practice involves building Bayesian models but checking them with frequentist tools like posterior predictive checks.

This pragmatism is reflected in modern Python tooling. PyMC, the leading Bayesian modeling library, integrates tightly with ArviZ, which provides frequentist-style diagnostics (trace plots, R-hat convergence statistics, effective sample size) for validating Bayesian model fits.

Let us see the practical difference in Python:

import numpy as np
from scipy import stats

# Observe 7 heads in 10 coin flips
n_flips = 10
n_heads = 7

# --- Frequentist approach ---
# MLE: just the proportion
p_hat = n_heads / n_flips
# 95% confidence interval (Wald)
se = np.sqrt(p_hat * (1 - p_hat) / n_flips)
ci_lower = p_hat - 1.96 * se
ci_upper = p_hat + 1.96 * se
print(f"Frequentist MLE: {p_hat:.3f}")
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")

# --- Bayesian approach ---
# Prior: Beta(1, 1) = uniform (we have no prior opinion)
alpha_prior, beta_prior = 1, 1

# Posterior: Beta(alpha_prior + heads, beta_prior + tails)
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + (n_flips - n_heads)
posterior = stats.beta(alpha_post, beta_post)

print(f"\nBayesian posterior mean: {posterior.mean():.3f}")
print(f"95% credible interval: [{posterior.ppf(0.025):.3f}, {posterior.ppf(0.975):.3f}]")
print(f"P(coin is fair, p > 0.4 and p < 0.6): {posterior.cdf(0.6) - posterior.cdf(0.4):.3f}")

That last line is something a frequentist framework simply cannot give you: the direct probability that the parameter lies in a specific range. The Bayesian can say "there is a 17% probability the coin is fair." The frequentist cannot make that statement at all — they can only say whether they reject or fail to reject a null hypothesis at some significance level.

The Three Ingredients: Priors, Likelihoods, and Posteriors

Every Bayesian analysis has three components, and understanding each is essential to writing correct code.

The Prior

The prior distribution, P(H), encodes what you believe before seeing data. This is the part that makes frequentists uncomfortable. In practice, priors come from three sources. Informative priors encode genuine domain knowledge: if previous clinical trials found a drug effect of approximately 2.5 with a standard deviation of 0.8, you might use Normal(2.5, 0.8) as your prior. Weakly informative priors constrain the parameter to a plausible range without strongly influencing the posterior: a Normal(0, 10) on a regression coefficient says "the effect is probably not larger than 30, but I'm not sure of the sign." Uninformative priors attempt to let the data speak entirely: a uniform distribution or Jeffreys prior.

from scipy import stats
import numpy as np

# Three different priors for a proportion
x = np.linspace(0, 1, 1000)

# Uninformative: Beta(1, 1) = Uniform
uninformative = stats.beta(1, 1)

# Weakly informative: Beta(2, 2) - slight preference for middle values
weakly_informative = stats.beta(2, 2)

# Informative: Beta(20, 5) - strong belief the proportion is around 0.8
informative = stats.beta(20, 5)

# After observing 3 successes in 10 trials, the posteriors differ:
successes, trials = 3, 10
failures = trials - successes

for name, prior_a, prior_b in [("Uninformative", 1, 1),
                                 ("Weakly informative", 2, 2),
                                 ("Informative", 20, 5)]:
    post = stats.beta(prior_a + successes, prior_b + failures)
    print(f"{name:>20s} prior -> Posterior mean: {post.mean():.3f}, "
          f"95% CI: [{post.ppf(0.025):.3f}, {post.ppf(0.975):.3f}]")

Pro Tip

Run this and you will see the informative prior fighting the data. With only 3 successes in 10 trials, the data suggests a proportion around 0.3. But the informative prior strongly believes the proportion is near 0.8. With 10 observations, the informative prior wins. With 1,000 observations, the data would overwhelm any prior. This is the mechanism by which prior knowledge gets correctly weighted against evidence.

The Likelihood

The likelihood function, P(D|H), describes the probability of observing your data given specific parameter values. This is identical in Bayesian and frequentist statistics — both use the same likelihood. The difference is what they do with it.

A frequentist maximizes the likelihood to find the single best parameter estimate (Maximum Likelihood Estimation). A Bayesian multiplies the likelihood by the prior and normalizes, yielding a full distribution over parameter values.

# The likelihood function for binomial data
def binomial_likelihood(p, n_successes, n_trials):
    """P(data | p) for binomial model."""
    return stats.binom.pmf(n_successes, n_trials, p)

# Evaluate the likelihood across possible values of p
p_values = np.linspace(0.001, 0.999, 1000)
likelihoods = binomial_likelihood(p_values, n_successes=7, n_trials=10)

# The MLE is the value that maximizes this function
mle = p_values[np.argmax(likelihoods)]
print(f"Maximum Likelihood Estimate: {mle:.3f}")  # approx 0.700

# The Bayesian doesn't pick one value -- they use the entire curve
# weighted by the prior

The Posterior

The posterior distribution, P(H|D), is the output: your updated belief after seeing the data. In simple cases (like the Beta-Binomial above), the posterior has a closed-form solution because the Beta distribution is the conjugate prior for the Binomial likelihood. Conjugate priors were critical in the pre-computer era because they made the math tractable. As McGrayne notes in her history, the advent of cheap computing made "computational simplifications such as using conjugate priors completely unnecessary." Today, tools like PyMC compute posteriors for arbitrary models using MCMC sampling.

import pymc as pm
import arviz as az

# The same coin-flip problem, but using PyMC
with pm.Model() as coin_model:
    # Prior: uniform on [0, 1]
    p = pm.Beta("p", alpha=1, beta=1)

    # Likelihood: 7 heads in 10 flips
    obs = pm.Binomial("obs", n=10, p=p, observed=7)

    # Sample from the posterior
    trace = pm.sample(2000, tune=1000, cores=2, random_seed=42)

summary = az.summary(trace, var_names=["p"])
print(summary)

This is overkill for a coin flip — the Beta-Binomial conjugate solution is exact and instantaneous. But the point is that the PyMC code looks identical whether you are modeling a coin or a hierarchical spatiotemporal regression with latent variables. The model specification is always: declare priors, specify the likelihood, sample. The NUTS sampler (No-U-Turn Sampler, Hoffman and Gelman, 2014) handles the rest.

Bayesian Thinking Beyond Statistics

Bayesian reasoning is not just a statistical technique. It is a framework for thinking about evidence, and it appears throughout computer science and machine learning under different names.

Naive Bayes classifiers apply Bayes' theorem with a conditional independence assumption to classify text, emails, and other high-dimensional data. Scikit-learn implements several variants:

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# GaussianNB assumes features follow a Gaussian distribution
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(f"GaussianNB accuracy: {accuracy_score(y_test, y_pred):.3f}")

# Access the learned parameters -- these ARE the Bayesian posterior
print(f"Class priors: {gnb.class_prior_}")          # P(class)
print(f"Class means shape: {gnb.theta_.shape}")       # P(feature | class) params

The "naive" in Naive Bayes refers to the assumption that features are conditionally independent given the class label. This assumption is almost always wrong, yet the classifier works surprisingly well in practice. The reason is that Bayes' theorem is remarkably robust: even when the probability estimates are poorly calibrated, the rankings they produce tend to be correct.

Bayesian optimization uses Gaussian Process priors to efficiently search hyperparameter spaces. Libraries like Optuna and scikit-optimize implement this, fitting a Bayesian surrogate model to the objective function and choosing the next evaluation point by balancing exploration (trying unknown regions) and exploitation (refining known good regions). Every call to study.optimize() in Optuna is performing Bayesian reasoning about where the optimum might lie.

Bayesian neural networks place prior distributions on network weights, producing uncertainty estimates for predictions. While not yet mainstream, libraries like PyMC and TensorFlow Probability make them feasible, and the uncertainty quantification they provide is critical for safety-sensitive applications.

Related PEPs and Python Infrastructure

Several Python Enhancement Proposals have shaped the ecosystem that makes Bayesian computing practical.

PEP 450 (Final, Python 3.4) added the statistics module to the standard library. While its statistics.mean() and statistics.stdev() are frequentist computations, the PEP established the principle that basic statistical operations belong in Python's standard library. The module's NormalDist class (added in Python 3.8 via PEP 597/bpo-36018) provides a lightweight normal distribution object that can compute PDFs, CDFs, and quantiles — the building blocks of Bayesian conjugate analysis.

PEP 465 (Final, Python 3.5) introduced the @ operator for matrix multiplication. Before this, matrix operations required numpy.dot() or numpy.matmul() calls that obscured the mathematical intent. The @ operator is ubiquitous in Bayesian linear algebra — computing likelihood functions, covariance matrices, and posterior means all involve matrix multiplication, and X @ beta reads like the textbook formula.

PEP 484 (Final, Python 3.5) introduced type hints. PyMC's PyTensor backend uses type information extensively to validate model specifications at graph-construction time. Shape mismatches between prior distributions and data — a common source of bugs in probabilistic models — can be caught before any sampling begins because tensor shapes carry type-level metadata.

PEP 3118 (Final, Python 3.0) defined the buffer protocol that enables zero-copy sharing between NumPy arrays, C extensions, and compiled code. PyMC compiles model graphs to C (or JAX, or Numba) via PyTensor, and the buffer protocol is the mechanism by which sampled values flow between the compiled sampler and Python-level analysis code without costly memory copies.

PEP 517/518 (Final, Python 3.7+) modernized the build system with pyproject.toml. PyMC's migration path from Theano (pre-2020) to Aesara (2020–2022) to PyTensor (2022–present) required clean dependency management through three backend changes while maintaining API stability for users. The pyproject.toml standard made this migration significantly more manageable.

PEP 734 (Python 3.14, October 2025) introduced sub-interpreters for concurrent execution. MCMC sampling is embarrassingly parallel — multiple chains run independently to diagnose convergence — and sub-interpreters offer a path to true parallelism without the overhead of multiprocessing that PyMC currently uses when you set cores=4 in pm.sample().

Building Bayesian Intuition: A Complete Example

Let us work through a realistic problem end-to-end. Suppose you run an online store and want to estimate the conversion rate for a new checkout design. You have observed 47 purchases from 312 visitors.

import numpy as np
from scipy import stats

visitors = 312
purchases = 47

# ---- Step 1: Define the prior ----
# Your old checkout converted at about 12%. You're mildly optimistic
# about the new design but not certain.
# Beta(12, 88) has mean 0.12 and concentrates probability near 12%
alpha_prior = 12
beta_prior = 88
prior = stats.beta(alpha_prior, beta_prior)
print(f"Prior mean: {prior.mean():.3f}")
print(f"Prior 95% CI: [{prior.ppf(0.025):.3f}, {prior.ppf(0.975):.3f}]")

# ---- Step 2: Update with data ----
# Conjugate update: Beta(a + successes, b + failures)
alpha_post = alpha_prior + purchases
beta_post = beta_prior + (visitors - purchases)
posterior = stats.beta(alpha_post, beta_post)
print(f"\nPosterior mean: {posterior.mean():.3f}")
print(f"Posterior 95% CI: [{posterior.ppf(0.025):.3f}, {posterior.ppf(0.975):.3f}]")

# ---- Step 3: Answer business questions directly ----
# Probability the new design is better than the old 12% rate
p_better = 1 - posterior.cdf(0.12)
print(f"\nP(new design > 12% conversion): {p_better:.3f}")

# Probability the new design exceeds 15% conversion
p_great = 1 - posterior.cdf(0.15)
print(f"P(new design > 15% conversion): {p_great:.3f}")

# Expected revenue per 1000 visitors (assuming $50 per purchase)
# Using the full posterior distribution
samples = posterior.rvs(10000)
revenue_per_1000 = samples * 1000 * 50
print(f"\nExpected revenue per 1000 visitors: ${np.mean(revenue_per_1000):.0f}")
print(f"95% range: [${np.percentile(revenue_per_1000, 2.5):.0f}, "
      f"${np.percentile(revenue_per_1000, 97.5):.0f}]")

This is the payoff of Bayesian thinking for practitioners. You are not testing a null hypothesis. You are not computing a p-value. You are answering the questions your stakeholder actually asked: "Is the new design better? How confident are we? What does that mean for revenue?" And you are propagating uncertainty through the entire calculation, so your revenue estimate comes with a range, not a false-precision point estimate.

When Bayesian Methods Are Worth the Complexity

Bayesian methods are not always the right choice. For a quick A/B test with millions of data points, a chi-squared test gives you an answer in microseconds and the prior would be overwhelmed by data anyway. For training a deep neural network, the Bayesian posterior over millions of weights is currently intractable for production use.

Bayesian methods earn their keep when:

Data is scarce. With 10 observations, the prior matters. Bayesian estimation handles small samples gracefully through regularization that emerges naturally from the prior, rather than requiring ad hoc penalties.
Uncertainty quantification is critical. Medical decisions, financial risk models, autonomous systems — anywhere that acting on a point estimate without knowing its reliability could cause harm.
You need to incorporate prior knowledge. If decades of clinical research tell you a drug effect is approximately 2.0, ignoring that information and starting from scratch with each new trial wastes evidence. Informative priors are the formal mechanism for cumulative learning.
The model is hierarchical. Estimating parameters for many related groups (schools, hospitals, customers) benefits enormously from partial pooling, where data-poor groups borrow strength from data-rich groups. This is Bayesian inference's signature capability, and it has no clean frequentist equivalent.
You need to make sequential decisions. Bayesian updating is naturally sequential: the posterior from today's analysis becomes tomorrow's prior. This makes it ideal for adaptive experiments, online learning, and any setting where data arrives incrementally.

The Deeper Lesson

The mathematician John Allen Paulos, reviewing McGrayne's book in the New York Times Book Review, offered a summation that doubles as advice for Python practitioners: "If you're not thinking like a Bayesian, perhaps you should be." The Bayesian framework is not merely a statistical technique. It is a discipline of intellectual honesty — a commitment to stating your assumptions (the prior), updating them with evidence (the likelihood), and quantifying exactly how uncertain you remain (the posterior).

In a world saturated with point estimates and false precision — where models output confident predictions without any indication of how much they do not know — Bayesian thinking is the antidote. Not because it is always right, but because it is always honest about what it does not know. And in code, as in science, that honesty is where real understanding begins.

All code examples tested with Python 3.11, NumPy 1.26+, SciPy 1.12+, scikit-learn 1.4+, and PyMC 5.27+. ArviZ 0.19+ recommended for posterior diagnostics.