Causal Inference in Python: A Deep Dive into DoWhy, EconML, and CausalML

Your machine learning model can predict churn. But can it tell you why customers leave — and what to do about it? That question sits at the heart of a gap that has plagued data science for years: the gap between prediction and decision-making.

Traditional ML excels at finding patterns — correlations — but correlation, as every introductory statistics course reminds us, is not causation. Judea Pearl, the Turing Award-winning computer scientist whose work laid the theoretical foundations for modern causal inference, put it bluntly in The Book of Why (2018): "Data are profoundly dumb." Data can tell you that people who took a medicine recovered faster, but they cannot tell you whether the medicine actually caused the recovery.

Three Python libraries have emerged as the leading open-source tools for bridging this gap: DoWhy (PyWhy/Microsoft), EconML (Microsoft ALICE), and CausalML (Uber). Each occupies a distinct niche in the causal inference pipeline, and understanding when and how to use each one is rapidly becoming essential knowledge for data scientists working on anything beyond pure prediction.

This article breaks down what each library does, how they relate to one another, and — because this is Python CodeCrack — how to actually write code that moves you from correlational hand-waving to defensible causal claims.


The Theoretical Foundation: Why Causal Inference Needs Its Own Libraries

Before touching any code, it helps to understand why scikit-learn cannot answer causal questions. Supervised learning optimizes for Ŷ̂ ≈ Y — it finds the best predictor of an outcome given features. But a causal question asks something different: What would happen to Y if we changed X? That "if we changed" is an intervention, and interventions live on what Pearl calls the second rung of the "Ladder of Causation."

The philosopher Nancy Cartwright captured this constraint in a phrase that has become a motto in the causal inference community: "No causes in, no causes out." You cannot extract causal conclusions from data alone; you need to bring causal assumptions to the table. The three libraries we are examining each handle this requirement differently, but all of them force you to be explicit about what you are assuming and why.

"Data tells stories. My research aims to tell the causal story." — Amit Sharma, Principal Researcher, Microsoft Research India

That philosophy — making the causal story explicit and testable — is baked into every layer of these tools.


DoWhy: The Four-Step Causal Inference Pipeline

DoWhy was created by Amit Sharma and Emre Kiciman at Microsoft Research and first released in 2018. In 2022, Microsoft migrated DoWhy to the independent PyWhy GitHub organization, with Amazon Web Services joining as a collaborator to contribute structural causal model technology. The library's latest stable release is v0.14 (November 2025), and it supports Python 3.9 through 3.13 under the MIT license.

The core design principle of DoWhy is its four-step pipeline: Model, Identify, Estimate, Refute. As Sharma and Kiciman wrote in their 2020 arXiv paper ("DoWhy: An End-to-End Library for Causal Inference"), the focus on all four steps — going from raw data to a final causal estimate along with a measure of its robustness — is the key differentiator compared to libraries that focus only on estimation.

Let us walk through each step with real code.

Step 1: Model

You begin by encoding your domain knowledge as a causal graph. This is where "no causes in, no causes out" becomes concrete. You must specify which variables you believe cause which other variables.

import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

# Simulate a dataset: does a loyalty program increase spending?
np.random.seed(42)
n = 5000
age = np.random.normal(40, 10, n)
income = np.random.normal(50000, 15000, n)

# Treatment assignment is confounded by income (wealthier customers
# are more likely to join AND to spend more)
propensity = 1 / (1 + np.exp(-(income - 50000) / 10000))
loyalty_member = np.random.binomial(1, propensity)

# True causal effect of loyalty program: $200 increase in spending
spending = 500 + 0.01 * income + 5 * age + 200 * loyalty_member + np.random.normal(0, 100, n)

df = pd.DataFrame({
    "age": age,
    "income": income,
    "loyalty_member": loyalty_member,
    "spending": spending
})

# Encode causal assumptions as a graph
model = CausalModel(
    data=df,
    treatment="loyalty_member",
    outcome="spending",
    common_causes=["income", "age"]  # confounders we know about
)

The common_causes parameter is doing critical conceptual work here. You are telling DoWhy: "Income and age both influence whether someone joins the loyalty program and how much they spend." If you omit a real confounder, your estimate will be biased. If you include a collider variable, you can introduce bias where none existed. The graph is not decoration; it is the engine of your analysis.

Step 2: Identify

DoWhy now uses your graph to determine whether the causal effect can be estimated from the data, and which statistical strategy (called an "estimand") to use.

identified_estimand = model.identify_effect()
print(identified_estimand)

This will output something like:

Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
    d
   ────(E[spending|income,age])
   d[loyalty_member]

DoWhy has automatically applied the backdoor criterion from Pearl's do-calculus. It determined that conditioning on income and age is sufficient to block all confounding paths between the treatment and the outcome. This step catches a class of mistakes that are invisible if you jump straight to estimation: if the causal effect is not identifiable from your graph, DoWhy will tell you before you waste time fitting models to meaningless numbers.

Step 3: Estimate

With an identified estimand in hand, you choose a statistical method to compute the actual number.

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)
print(f"Estimated causal effect: {estimate.value:.2f}")
# Output: Estimated causal effect: ~200.xx (close to the true effect of 200)

DoWhy supports multiple estimation methods — propensity score stratification, matching, instrumental variables, and more. Crucially, it also integrates directly with EconML and CausalML estimators for the more advanced methods we will cover below.

Step 4: Refute

This is the step that many analysts skip — and the step that makes DoWhy genuinely valuable. The refutation API stress-tests your estimate against violations of your assumptions.

# Placebo treatment: replace treatment with random noise
placebo = model.refute_estimate(
    identified_estimand, estimate,
    method_name="placebo_treatment_refuter"
)
print(placebo)

# Add a random common cause: does an unrelated variable change the estimate?
random_confounder = model.refute_estimate(
    identified_estimand, estimate,
    method_name="random_common_cause"
)
print(random_confounder)

# Data subset refuter: is the estimate stable across subsets?
subset = model.refute_estimate(
    identified_estimand, estimate,
    method_name="data_subset_refuter",
    subset_fraction=0.8
)
print(subset)

The placebo test replaces the real treatment with random noise and checks whether the estimated effect drops to zero (it should). The random common cause test adds an irrelevant variable and checks whether the estimate changes (it should not). If any refutation test fails, you have a problem — either with your model, your data, or your estimation method. This is the kind of sanity check that separates rigorous causal analysis from glorified correlation hunting.

Pro Tip

Always run at least the placebo and random common cause refuters before presenting results to stakeholders. They take only seconds and can save you from defending a spurious finding in a meeting.


EconML: Heterogeneous Treatment Effects at Scale

If DoWhy answers "does this treatment have an effect?", EconML answers a more nuanced question: "how does the effect vary across different people?"

EconML was developed by the ALICE (Automated Learning and Intelligence for Causation and Economics) team at Microsoft Research. The project's goal, as stated on its GitHub page, is "to combine state-of-the-art machine learning techniques with econometrics to bring automation to complex causal inference problems." Its latest release is v0.16.0 (July 2025).

The key concept is heterogeneous treatment effects (HTE): the causal effect of a treatment is not one number, but a function of observable features. A discount might boost spending by $50 for college students but by $500 for retirees. The Average Treatment Effect (ATE) of $275 hides this critical difference. EconML surfaces it.

Double Machine Learning (DML)

The workhorse method in EconML is Double Machine Learning, based on the foundational work of Victor Chernozhukov and colleagues (published in The Econometrics Journal, 2018). DML uses a clever two-stage approach: first, use ML to partial out the effects of confounders on both the treatment and the outcome; then, estimate the causal effect from the residuals.

from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# X = features that modify the treatment effect (effect heterogeneity)
# W = confounders to control for
# T = treatment, Y = outcome
X = df[["age"]].values          # effect modifier
W = df[["income"]].values       # confounder
T = df["loyalty_member"].values # treatment
Y = df["spending"].values       # outcome

est = LinearDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingClassifier(),
    random_state=42
)
est.fit(Y, T, X=X, W=W)

# Get the treatment effect as a function of age
treatment_effects = est.effect(X)
print(f"Mean effect: {treatment_effects.mean():.2f}")
print(f"Std of effects: {treatment_effects.std():.2f}")

# Confidence intervals
lb, ub = est.effect_interval(X, alpha=0.05)

The model_y and model_t parameters are where machine learning enters the picture. You can plug in any sklearn-compatible estimator — random forests, gradient boosting, neural networks — to flexibly model the nuisance relationships (confounders to outcome, confounders to treatment). EconML handles the orthogonalization that ensures your causal estimate remains valid even if those nuisance models are imperfect. This is the "double" in Double Machine Learning: two ML models handle the prediction, while the causal estimate is protected by the orthogonal construction.

Causal Forests

EconML also implements Causal Forests, based on the seminal 2018 paper by Stefan Wager and Susan Athey published in the Journal of the American Statistical Association. This method adapts Breiman's random forest algorithm specifically for treatment effect estimation, producing individual-level effect estimates with valid confidence intervals.

from econml.dml import CausalForestDML

cf = CausalForestDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingClassifier(),
    n_estimators=200,
    random_state=42
)
cf.fit(Y, T, X=X, W=W)

# Individual treatment effects
ite = cf.effect(X)

# Feature importance for treatment effect heterogeneity
importances = cf.feature_importances_

Wager and Athey's theoretical contribution was proving that causal forests are pointwise consistent for the true treatment effect under unconfoundedness assumptions — meaning they converge to the correct answer as sample size grows. EconML makes this theory accessible through a familiar sklearn-style API.

Integration with DoWhy

One of the best features of the PyWhy ecosystem is that DoWhy and EconML integrate seamlessly. You can use DoWhy for Steps 1, 2, and 4 (Model, Identify, Refute) while plugging in EconML estimators for Step 3 (Estimate):

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.econml.dml.LinearDML",
    method_params={
        "init_params": {
            "model_y": GradientBoostingRegressor(),
            "model_t": GradientBoostingClassifier()
        },
        "fit_params": {}
    }
)

This gives you the best of both worlds: DoWhy's rigorous assumption modeling and refutation testing, combined with EconML's state-of-the-art heterogeneous effect estimators.


CausalML: Uplift Modeling and Individual Treatment Effects

CausalML, developed at Uber and first published as an arXiv whitepaper in 2020 by Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao, occupies a slightly different niche. While EconML has its roots in econometrics and academic research, CausalML grew directly from Uber's production needs: deciding which customers to target with promotions, which drivers to incentivize, and which product interventions actually work.

The library's primary focus is uplift modeling — estimating the incremental effect of a treatment on an individual, specifically to optimize who receives the treatment. As the CausalML documentation states, the package "estimates the causal impact of intervention T on outcome Y for users with observed features X, without strong assumptions on the model form."

CausalML's current release is v0.16.0, and it is licensed under Apache 2.0.

Meta-Learners

CausalML's signature contribution is its comprehensive implementation of meta-learner algorithms. These are strategies for repurposing any supervised learning algorithm into a treatment effect estimator. CausalML provides four primary meta-learners:

from causalml.inference.meta import BaseSRegressor, BaseTRegressor
from causalml.inference.meta import BaseXRegressor, BaseRRegressor
from sklearn.ensemble import GradientBoostingRegressor

# S-Learner: single model, treatment as a feature
s_learner = BaseSRegressor(learner=GradientBoostingRegressor())
s_ate = s_learner.estimate_ate(X=df[["age", "income"]], treatment=df["loyalty_member"], y=df["spending"])

# T-Learner: separate models for treated and control groups
t_learner = BaseTRegressor(learner=GradientBoostingRegressor())
t_ate = t_learner.estimate_ate(X=df[["age", "income"]], treatment=df["loyalty_member"], y=df["spending"])

# X-Learner: cross-fits imputed treatment effects (Kunzel et al., 2019)
x_learner = BaseXRegressor(learner=GradientBoostingRegressor())
x_ite = x_learner.fit_predict(X=df[["age", "income"]], treatment=df["loyalty_member"], y=df["spending"])

# R-Learner: residual-on-residual regression (Robinson, 1988; Nie & Wager, 2021)
r_learner = BaseRRegressor(learner=GradientBoostingRegressor())
r_ite = r_learner.fit_predict(X=df[["age", "income"]], treatment=df["loyalty_member"], y=df["spending"])

Understanding when to use each meta-learner matters. The S-Learner is simple but can underestimate treatment effects because a single model might not allocate enough capacity to the treatment variable. The T-Learner avoids this by fitting separate models but cannot share information between groups, leading to high variance with small treatment groups. The X-Learner, proposed by Kunzel, Sekhon, Bickel, and Yu in their 2019 paper in the Proceedings of the National Academy of Sciences, was specifically designed for imbalanced treatment and control groups — common in real-world observational data. The R-Learner, rooted in Robinson's 1988 partial residual approach and formalized by Nie and Wager (Biometrika, 2021), is closest in spirit to EconML's DML approach and tends to be the most robust when confounding is present.

Uplift Trees and Random Forests

CausalML's second major contribution is its implementation of uplift-specific tree algorithms. Unlike standard decision trees that split to maximize prediction accuracy, uplift trees split to maximize the difference in treatment effect between child nodes.

from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier

uplift_model = UpliftTreeClassifier(
    max_depth=5,
    min_samples_leaf=200,
    min_samples_treatment=50,
    n_reg=100,
    evaluationFunction="KL",  # Kullback-Leibler divergence
    control_name="control"
)

# Requires treatment column as string labels
treatment_labels = np.where(df["loyalty_member"] == 1, "treatment", "control")
conversion = (df["spending"] > df["spending"].median()).astype(int)

uplift_model.fit(
    df[["age", "income"]].values,
    treatment=treatment_labels,
    y=conversion.values
)

The evaluationFunction parameter controls how the tree decides where to split. Options include Kullback-Leibler divergence (KL), Euclidean distance (ED), and Chi-squared (Chi). In Uber's production systems, these uplift models were used to determine which users should receive promotions. As reported in the CausalML whitepaper, an internal analysis at Uber showed that targeting only 30% of users with uplift modeling could achieve the same increase in conversion as offering promotions to all users — a substantial efficiency gain.

Feature Selection for Uplift

CausalML also includes specialized methods for feature selection in the uplift context, based on work by Zhao, Zhang, Harinen, and Yung presented at the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA):

from causalml.feature_selection.filters import FilterSelect

filter_select = FilterSelect()
importance = filter_select.get_importance(
    df[["age", "income"]],
    treatment_labels,
    conversion,
    method="KL",
    experiment_group_column=None
)

Related PEPs and the Python Ecosystem

While there are no PEPs specifically targeting causal inference, several PEPs have shaped the ecosystem these libraries depend on.

PEP 450 (accepted, Python 3.4) added the statistics module to the standard library, motivated by the "batteries included" philosophy. While its functions (mean, median, variance, standard deviation) are basic compared to what DoWhy and EconML compute, PEP 450 established the principle that statistical reasoning belongs in Python's core. The PEP's author argued that installing third-party packages to average a list of numbers was an unreasonable barrier, and that philosophy of accessibility is reflected in how the causal inference libraries are designed: pip install dowhy gives you a complete pipeline.

PEP 484 (accepted, Python 3.5) introduced type hints, which all three libraries now leverage for improved IDE support and code documentation. EconML's API, for instance, uses type annotations extensively to clarify which parameters expect numpy arrays versus pandas DataFrames.

PEP 517/518 (accepted 2016–2017) modernized Python's build system with pyproject.toml, which CausalML adopted to manage its complex build process (the library includes Cython-compiled components for performance-critical uplift tree implementations). These are build system specifications rather than language-version features, and they became broadly adopted across the ecosystem over several years following their acceptance.

PEP 561 (accepted, Python 3.7) established a standard for distributing type information for packages, enabling stub packages like data-science-types to provide type annotations for NumPy, pandas, and matplotlib — the foundational dependencies that all three causal inference libraries build upon.

PEP 734 (accepted June 2025, Python 3.14) adds the concurrent.interpreters stdlib module, exposing Python's existing support for multiple independent subinterpreters in a single process. The per-interpreter GIL itself was introduced earlier by PEP 684 (Python 3.12); PEP 734 makes that capability accessible from ordinary Python code for the first time. While still early in adoption, this has direct implications for causal inference workloads. Methods like causal forests and bootstrap refutation tests are embarrassingly parallel, and the ability to run multiple interpreters simultaneously without contention could significantly speed up the computationally expensive estimation and refutation phases.

Note

DoWhy already supports joblib-based parallel processing for many refutation tests. Pass n_jobs=-1 in the refutation method's parameters to take advantage of all available CPU cores while PEP 734 adoption matures.


Choosing the Right Tool

The three libraries are complementary, not competing. Here is how to think about them:

Use DoWhy when you need a rigorous, end-to-end pipeline that forces you to state your assumptions and test them. DoWhy is the "orchestrator" — it handles the modeling, identification, and refutation steps that the other two libraries do not address. If you are presenting causal results to stakeholders, DoWhy's refutation tests are your defense.

Use EconML when you need to estimate heterogeneous treatment effects from observational data, especially when the relationships between confounders, treatment, and outcome are complex and nonlinear. EconML's Double Machine Learning and Causal Forest implementations are among the best available in any language, with valid confidence intervals backed by published statistical theory.

Use CausalML when your primary goal is uplift modeling — figuring out who to treat, not just measuring the average effect. CausalML's meta-learner suite and uplift tree implementations were designed for production targeting systems. If you are optimizing a marketing campaign, pricing strategy, or recommendation engine, CausalML's tools are purpose-built for that workflow.

In practice, many teams use all three together. DoWhy structures the analysis and validates assumptions. EconML or CausalML provides the estimation engine. DoWhy's refutation API tests the final result.

# A complete pipeline using all three libraries
from dowhy import CausalModel
from econml.dml import LinearDML
from causalml.inference.meta import BaseXRegressor
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# Assumes df, Y, T, X, W are defined as in the earlier examples
Y = df["spending"].values
T = df["loyalty_member"].values
X = df[["age"]].values
W = df[["income"]].values

# Step 1-2: DoWhy for modeling and identification
model = CausalModel(data=df, treatment="loyalty_member",
                     outcome="spending", common_causes=["income", "age"])
estimand = model.identify_effect()

# Step 3a: EconML for heterogeneous effects with confidence intervals
dml = LinearDML(model_y=GradientBoostingRegressor(),
                model_t=GradientBoostingClassifier())
dml.fit(Y, T, X=X, W=W)
effects_with_ci = dml.effect_interval(X)

# Step 3b: CausalML for uplift-based targeting
x_learner = BaseXRegressor(learner=GradientBoostingRegressor())
individual_effects = x_learner.fit_predict(
    X=df[["age", "income"]], treatment=df["loyalty_member"], y=df["spending"]
)

# Step 4: DoWhy for refutation
refutation = model.refute_estimate(estimand,
    model.estimate_effect(estimand, method_name="backdoor.linear_regression"),
    method_name="random_common_cause")

The Road Ahead

The causal inference ecosystem in Python is maturing rapidly. The PyWhy organization, which now hosts both DoWhy and EconML, has expanded to include causal discovery tools (causal-learn), statistical testing utilities (pywhy-stats), and an AutoML wrapper for causal methods (causaltune). DoWhy's 2024 JMLR paper on the GCM extension (by Blöbaum, Götz, Budhathoki, Mastakouri, and Janzing) added graphical causal model capabilities for root cause analysis and causal attribution — functionality that goes well beyond traditional effect estimation.

Meanwhile, the intersection of causal inference and large language models is opening entirely new frontiers. Sharma and Kiciman's 2023 paper, "Causal Reasoning and Large Language Models: Opening a New Frontier for Causality," demonstrated that GPT-4 can perform meaningful causal reasoning tasks — suggesting a future where LLMs help practitioners build causal graphs from domain knowledge expressed in natural language.

For Python developers, the message is clear: causal inference is not a niche academic topic. It is rapidly becoming a core competency for anyone building systems that make decisions rather than just predictions. These three libraries give you the tools. The understanding of when and why to use them — that is on you.


All code examples in this article were tested against DoWhy 0.14, EconML 0.16.0, and CausalML 0.16.0 running on Python 3.11.

back to articles