Causal Discovery in Python: gCastle and CausalNex from the Ground Up

Your machine learning model found a correlation. Great. But does variable A actually cause variable B, or are they both responding to some lurking confounder you never measured?

This is not a philosophical exercise — it is the difference between a model that survives deployment and one that collapses the moment conditions shift. Causal discovery is the field dedicated to answering that question algorithmically: given observational data, can we learn the directed acyclic graph (DAG) that encodes the true cause-and-effect relationships among variables? Two Python libraries have made this dramatically more accessible to working practitioners: gCastle, developed by Huawei Noah's Ark Lab, and CausalNex, created by QuantumBlack (McKinsey's AI arm). This article breaks down what they do, how they differ, and — because this is pythonCodeCrack — how to write real code that actually works with them.

Why Causal Discovery Matters Now

Judea Pearl, the Turing Award-winning computer scientist who formalized modern causal inference, put it bluntly in The Book of Why (2018): statistics has long told us that correlation is not causation, but it never told us what causation actually is. Pearl's do-calculus and structural causal models gave us the mathematical machinery. The problem was always computational: the space of possible DAGs grows super-exponentially with the number of variables. For just 10 variables, the number of possible DAGs exceeds 4 × 1018 — a combinatorial space so vast that exhaustive search is completely intractable. Traditional constraint-based methods like the PC algorithm (Spirtes, Glymour, and Scheines, 2000) approach this through conditional independence tests, pruning edges one at a time. Score-based methods like GES (Greedy Equivalence Search) walk through equivalence classes. Both have limitations in scale.

The game changed in 2018. Xun Zheng and colleagues at Carnegie Mellon published "DAGs with NO TEARS" at NeurIPS, reformulating structure learning as a continuous optimization problem. Instead of searching a combinatorial space, NOTEARS defines a smooth constraint function h(W) = tr(e^{W ○ W}) - d = 0 whose zero set corresponds exactly to acyclic graphs, then solves the whole thing with standard gradient-based optimization. As the CMU Machine Learning Blog described it, the core optimization logic of the algorithm is remarkably compact. This breakthrough — turning a super-exponential search into something that scales cubically per iteration — is the engine underneath both gCastle and CausalNex.

Bernhard Schölkopf of the Max Planck Institute argued in his influential 2019 paper "Causality for Machine Learning" that the hardest open problems in ML and AI are intrinsically linked to causality. Transfer learning, domain adaptation, robustness to distribution shift — all of these, he contended, require understanding causal structure rather than just fitting statistical associations. That thesis has only gained momentum since.

gCastle: Huawei's Causal Structure Learning Toolchain

gCastle (short for gradient-based Causal Structure LEarning) was released in 2021 by Huawei Noah's Ark Lab, with the research led by Keli Zhang, Shengyu Zhu, Marcus Kalander, Ignavier Ng, and colleagues. The project sits within Huawei's broader trustworthyAI initiative, and the team published their technical paper on arXiv (2111.15155) describing the design philosophy.

The library's distinguishing feature is its focus on gradient-based methods. While it includes classic algorithms (PC, GES, DirectLiNGAM), its core strength is providing GPU-accelerated implementations of modern continuous optimization approaches that can handle hundreds of variables — a scale where traditional methods break down.

What gCastle Includes

The algorithm catalog spans three major categories of causal discovery.

Constraint-based methods work through conditional independence testing. The PC algorithm is the canonical example: it starts from a fully connected graph and removes edges whenever a conditional independence is detected, iterating until no further removals are possible.

Score-based methods optimize a score function (like BIC) over graph structures. GES is the classical representative, conducting a two-phase greedy search — first adding edges, then removing them — to maximize a decomposable score.

Gradient-based methods are where gCastle really differentiates itself. The library implements NOTEARS (linear), NOTEARS-Nonlinear (with MLP and Sobolev space variants), NOTEARS-Low-Rank (for large sparse graphs), GOLEM (a more efficient NOTEARS variant that reduces optimization iterations), DAG-GNN (using graph neural networks), and reinforcement learning approaches like RL-BIC. It also includes function-based methods such as DirectLiNGAM and ICALiNGAM for linear non-Gaussian models, and ANM (Additive Noise Models) for nonlinear cases.

Real Code: Structure Learning with gCastle

Here is a complete, runnable example that simulates data from a known causal graph, learns the structure using three different algorithms, and evaluates how each one performed. This is not a toy snippet — it is the workflow you would actually use.

import numpy as np
from castle.common import GraphDAG
from castle.metrics import MetricsDAG
from castle.datasets import IIDSimulation, DAG
from castle.algorithms import PC, GES, Notears

# Step 1: Generate a ground-truth DAG and simulate data from it.
# This creates an Erdos-Renyi graph with 8 nodes and roughly 16 edges,
# then samples 2000 observations from a linear Gaussian structural
# equation model defined by that graph.
true_dag = DAG.erdos_renyi(n_nodes=8, n_edges=16, seed=42)
dataset = IIDSimulation(W=true_dag, n=2000, method='linear', sem_type='gauss')
X = dataset.X  # shape: (2000, 8) -- our observational data
W_true = dataset.B  # the true weighted adjacency matrix

# Step 2: Learn structure with the PC algorithm.
# PC is constraint-based: it starts fully connected and removes edges
# based on conditional independence tests. Fast, interpretable,
# but returns a CPDAG (some edges may be undirected).
pc = PC(variant='stable')
pc.learn(X)

# Step 3: Learn structure with GES.
# GES is score-based: it searches equivalence classes by greedily
# adding then removing edges to optimize the BIC score.
ges = GES(criterion='bic')
ges.learn(X)

# Step 4: Learn structure with NOTEARS.
# This is the gradient-based approach. It solves a continuous
# optimization problem: minimize least-squares loss subject to
# the acyclicity constraint h(W) = tr(e^{W circ W}) - d = 0.
notears = Notears(lambda1=0.1, loss_type='l2')
notears.learn(X)

# Step 5: Evaluate all three against ground truth.
# SHD (Structural Hamming Distance) counts edge additions, deletions,
# and reversals needed to match the true graph. Lower is better.
for name, algo in [('PC', pc), ('GES', ges), ('NOTEARS', notears)]:
    metrics = MetricsDAG(algo.causal_matrix, W_true)
    m = metrics.metrics
    print(f"\n{name}:")
    print(f"  SHD: {m['shd']:.0f}  |  F1: {m['f1']:.3f}  "
          f"|  TPR: {m['tpr']:.3f}  |  FDR: {m['fdr']:.3f}")

The MetricsDAG class gives you the essential evaluation suite: Structural Hamming Distance (SHD), F1 score, true positive rate (TPR), and false discovery rate (FDR). These are the standard metrics in the causal discovery literature — you will see them in every benchmarking paper.

When to Pick Which Algorithm

Understanding when each method excels is not optional knowledge — it determines whether your results mean anything.

PC works well when you have plenty of data relative to the number of variables and the underlying relationships are roughly linear. It scales reasonably and returns quickly. The catch: it gives you a CPDAG (completed partially directed acyclic graph), meaning some edges remain undirected. You know there is a relationship, but not which direction the causation flows.

NOTEARS excels when you need a fully directed graph and your data is reasonably well-behaved (linear relationships, Gaussian or sub-Gaussian noise). It handles larger variable counts than PC typically can, and GPU acceleration via PyTorch makes it practical for hundreds of nodes. However, NOTEARS has a documented sensitivity to variable scaling — a 2021 analysis by Reisach and colleagues demonstrated that rescaling variables can substantially change NOTEARS' output, a concern the causal discovery community now refers to as the "varsortability" problem. Always normalize or standardize your data before running NOTEARS.

DirectLiNGAM is the right choice when you can assume linear relationships with non-Gaussian noise. The non-Gaussianity assumption is what allows it to identify the full DAG (not just an equivalence class), since the asymmetry in non-Gaussian distributions breaks the directional symmetry that makes edge orientation unidentifiable in Gaussian models.

// tip: start with simulation

Before running any algorithm on real data, generate a synthetic dataset from a known DAG and see how well each algorithm recovers it. The IIDSimulation + MetricsDAG workflow above exists precisely for this. It calibrates your expectations and reveals which algorithms hold up under your data's specific conditions (sample size, noise type, variable count) before you commit to one approach on real data where ground truth is unknown.

CausalNex: McKinsey's Bayesian Network Approach

CausalNex was launched in January 2020 by QuantumBlack, McKinsey's AI division. It was conceived and originally prototyped by Paul Beaumont and Ben Horsburgh, who built it to address challenges they encountered doing causal inference on consulting projects. As Horsburgh explained in an interview published by McKinsey in February 2020, traditional ML models recognize correlations and patterns but that does not establish causation. He offered a concrete example: ice cream sales and droughts both spike in summer, but neither causes the other.

Where gCastle is algorithm-centric (offering many structure learning methods), CausalNex is workflow-centric. It provides an end-to-end pipeline that takes you from structure learning through Bayesian Network fitting to intervention analysis via do-calculus. The design philosophy, as described in QuantumBlack's introductory materials, is that building causality analysis from scratch typically requires multiple different open-source libraries before arriving at the final step of finding the right intervention. CausalNex consolidates that into a single coherent API.

The CausalNex Architecture

CausalNex is built around three core components.

Structure Learning. CausalNex uses a NOTEARS implementation to learn DAG structure from data. It later added DYNOTEARS, a variant published by QuantumBlack researcher Roxana Pamfil and colleagues at AISTATS 2020, which extends the framework to time-series data by learning both contemporaneous (within-time-slice) and lagged (across-time-slice) causal relationships.

Bayesian Network Fitting. Once you have a structure (whether learned from data, built from domain knowledge, or a blend of both), CausalNex fits conditional probability distributions to each node. A deliberate design choice here: CausalNex uses discrete probability distributions, requiring you to discretize continuous variables first. The QuantumBlack team explained their reasoning — continuous distributions rely too heavily on normality assumptions that real-world data rarely satisfies, and discretization, while losing some information, yields more robust models in practice.

Intervention Simulation via Do-Calculus. This is where CausalNex goes beyond structure learning. Pearl's do-calculus lets you compute P(Y | do(X)) — the probability of Y given that you intervene to set X to a particular value, as opposed to merely observing that X takes that value. The difference between P(Y | X) and P(Y | do(X)) is the mathematical core of causal reasoning. CausalNex wraps this in its InferenceEngine, letting you simulate "what if" scenarios.

Real Code: End-to-End CausalNex Workflow

import pandas as pd
import numpy as np
from causalnex.structure import StructureModel
from causalnex.structure.notears import from_pandas
from causalnex.network import BayesianNetwork
from causalnex.inference import InferenceEngine
from causalnex.discretiser import Discretiser

# Step 1: Generate some synthetic data with known causal structure.
# True DAG: marketing_spend -> website_visits -> signups -> revenue
# Also: brand_awareness -> website_visits (a second cause of visits)
np.random.seed(42)
n = 3000

marketing_spend = np.random.normal(50, 15, n)
brand_awareness = np.random.normal(60, 10, n)
website_visits = (
    0.8 * marketing_spend
    + 0.5 * brand_awareness
    + np.random.normal(0, 5, n)
)
signups = 0.3 * website_visits + np.random.normal(0, 3, n)
revenue = 2.5 * signups + np.random.normal(0, 10, n)

df = pd.DataFrame({
    'marketing_spend': marketing_spend,
    'brand_awareness': brand_awareness,
    'website_visits': website_visits,
    'signups': signups,
    'revenue': revenue,
})

# Step 2: Learn the causal structure using NOTEARS.
# from_pandas returns a StructureModel (a networkx DiGraph subclass).
# The threshold parameter controls sparsity: edges with absolute
# weight below this are dropped.
sm = from_pandas(df, tabu_edges=[], w_threshold=0.8)
print("Learned edges:", sm.edges())

# Step 3: Inspect and optionally edit the structure.
# This is where domain expertise enters. Maybe NOTEARS found a
# spurious edge. You can add or remove edges manually.
# sm.add_edge('marketing_spend', 'brand_awareness')  # if domain knowledge says so
# sm.remove_edge('signups', 'brand_awareness')  # if this seems wrong

# Step 4: Discretize continuous data.
# CausalNex requires discrete data for Bayesian Network fitting.
# Quantile-based discretization splits each feature into bins with
# roughly equal numbers of observations.
df_disc = df.copy()
for col in df_disc.columns:
    df_disc[col] = Discretiser(
        method='quantile', num_buckets=5
    ).transform(df_disc[col].values)

# Step 5: Fit the Bayesian Network.
bn = BayesianNetwork(sm)
bn = bn.fit_node_states(df_disc)
bn = bn.fit_cpds(df_disc, method='BayesianEstimator', bayes_prior='K2')

# Step 6: Run intervention analysis with do-calculus.
# Question: "What happens to revenue if we increase marketing spend
# to the highest bucket?"
ie = InferenceEngine(bn)

# Baseline: no intervention
# ie.query() returns a dict mapping each node to a dict of {state: probability}.
baseline = ie.query()['revenue'][4]  # probability of revenue in top quintile bucket
print(f"Baseline P(revenue=4): {baseline:.4f}")

# Intervention: do(marketing_spend = 4)  -- set to highest bucket
ie.do_intervention('marketing_spend', 4)
post_intervention = ie.query()['revenue'][4]
print(f"After do(marketing_spend=4), P(revenue=4): {post_intervention:.4f}")

# Reset interventions for the next analysis
ie.reset_do('marketing_spend')

That final step — the do-intervention — is the whole point. Observing that high marketing spend is correlated with high revenue does not tell you whether increasing the spend will cause revenue to go up (it might be that both are driven by seasonal effects, company size, or some other confounder). The do-operator strips away those confounding pathways and estimates the isolated causal effect.

// watch out: discretization bin labels

After quantile-based discretization with 5 buckets, each variable takes integer values 0–4. ie.query() (called with no arguments) returns a dict of dicts: the outer key is the node name, and the inner key is the bin label. So ie.query()['revenue'][4] retrieves the probability that revenue falls in the top quintile bucket — not that revenue equals 4 in the original scale. Passing a dict directly to query() as evidence is a separate pattern used for conditional queries, not marginal queries. Confusing these two call signatures is a common runtime error when first using CausalNex.

Related PEPs and Python Infrastructure

Causal discovery in Python does not exist in a vacuum — it depends on the scientific computing ecosystem that Python's community has deliberately cultivated through the PEP (Python Enhancement Proposal) process.

PEP 450 (Steven D'Aprano, 2013), which introduced the statistics module into the standard library starting in Python 3.4, established the principle that basic statistical computation belongs in Python's core batteries. D'Aprano argued that while NumPy is a full-featured library aimed at professionals, many users just need reliable mean and variance calculations. The statistics module does not compete with NumPy or SciPy — it is a foundation. But causal discovery libraries like gCastle and CausalNex build on top of those heavier tools: NumPy arrays are the native data format for gCastle's algorithms, and CausalNex leans on pandas DataFrames, which themselves wrap NumPy.

PEP 484 (Guido van Rossum, Jukka Lehtosalo, Lukasz Langa, 2015), which introduced type hints, is increasingly relevant to these libraries. Both gCastle and CausalNex accept NumPy arrays and pandas DataFrames as inputs, and type annotations make the API contracts explicit. As the ecosystem matures, typed interfaces reduce the "did I pass the right shape?" debugging that consumes so much time when chaining together data simulation, structure learning, and evaluation.

PEP 3107 (Collin Winter, Tony Lownds, 2006), the original function annotations PEP, laid the groundwork for PEP 484. Function annotations started as a flexible mechanism with no enforced semantics, but their adoption for type hints has made scientific Python code substantially more navigable — especially in libraries like causal-learn (CMU's causal discovery package) and the broader PyWhy ecosystem, which leverage typing to document complex function signatures involving graphs, matrices, and configuration objects.

PEP 465 (Nathaniel Smith, 2014), which added the @ operator for matrix multiplication in Python 3.5, matters more than it might seem. The NOTEARS algorithm's core computation involves matrix exponentials, Hadamard products, and weighted adjacency matrix operations. The @ operator lets you express W @ W cleanly instead of np.dot(W, W), making the mathematical intent of the code legible. When you are implementing or debugging a causal discovery algorithm, readability at the linear algebra level is not a luxury.

Head-to-Head: gCastle vs. CausalNex

These libraries solve overlapping but different problems. Here is a practical comparison.

Algorithm breadth. gCastle wins decisively. It offers over a dozen structure learning algorithms spanning constraint-based, score-based, function-based, and gradient-based categories, plus reinforcement learning approaches. CausalNex primarily offers NOTEARS and DYNOTEARS.

End-to-end workflow. CausalNex wins here. It handles the full pipeline: structure learning, domain knowledge integration, Bayesian Network fitting, probability queries, and intervention simulation. gCastle focuses on structure learning and evaluation metrics; if you want do-calculus-based intervention analysis, you need to bring your own tools (like pgmpy or DoWhy).

Scale and performance. gCastle has explicit GPU acceleration through PyTorch for its gradient-based methods. CausalNex's NOTEARS uses SciPy optimization and does not natively support GPU. For large problems (hundreds of variables), gCastle is the more practical choice.

Time-series support. CausalNex includes DYNOTEARS for dynamic Bayesian networks. gCastle's primary focus is on IID data, though some of its algorithms can be adapted for temporal settings.

Maintenance status. Both libraries are open source under Apache 2.0. CausalNex's last significant release (v0.12.1) supports Python 3.8–3.10. gCastle (v1.0.4) continues to receive updates through the Huawei Noah's Ark Lab trustworthyAI repository. Neither is abandoned, but neither has the velocity of projects like the PyWhy ecosystem (DoWhy + causal-learn), which has broader community backing.

Practical Advice: What to Actually Do

If you are starting a causal discovery project today, here is a grounded recommendation based on what these tools actually do well.

Start with gCastle for exploration. Use its data simulation capabilities to build intuition. Generate known DAGs, simulate data from them, run multiple algorithms, and see how well each recovers the truth under different conditions (sample size, noise level, linearity assumptions). The MetricsDAG class makes comparison systematic. This is not busywork — it calibrates your expectations for when you apply these methods to real data where the ground truth is unknown.

Move to CausalNex when you need decisions. Once you have a candidate structure (whether learned from data, elicited from domain experts, or both), CausalNex's Bayesian Network fitting and do-calculus inference let you ask the questions that matter: "If we intervene on X, what happens to Y?" This is the bridge from academic causal discovery to operational decision-making.

Combine with domain knowledge ruthlessly. No algorithm can recover the true causal graph from purely observational data without assumptions. The PC algorithm assumes causal sufficiency (no unobserved confounders) and faithfulness. NOTEARS assumes a linear structural equation model in its basic form. These are strong assumptions. Domain expertise — encoding which edges are impossible, which must exist, which directions are known — dramatically improves results. Both gCastle (via prior_knowledge parameters) and CausalNex (via manual graph editing) support this.

The Bigger Picture

Sam Bourton, a QuantumBlack partner, stated during CausalNex's launch that open sourcing makes strategic sense because it helps clients avoid vendor lock-in. That pragmatism extends to the tools themselves: neither gCastle nor CausalNex claims to be the final word on causal discovery. They are tools with specific strengths, built on top of algorithms with specific assumptions, operating within a field that is still rapidly evolving.

The real value of these libraries is not that they give you a causal graph at the push of a button. It is that they lower the barrier to asking causal questions in the first place. Before NOTEARS, before gCastle, before CausalNex, even formulating the structure learning problem required deep expertise in graphical models. Now, a Python developer with a solid grasp of their data can generate plausible candidate structures, evaluate them against ground truth in simulated settings, and use them to reason about interventions — all within a single afternoon.

That is a meaningful shift. And it is happening in Python, where the scientific computing ecosystem, shaped by PEPs like 450, 465, and 484, provides the foundation these tools stand on.

The code runs. The algorithms converge. The rest — choosing the right assumptions, injecting the right domain knowledge, and interpreting the results honestly — is still up to you.


References and Further Reading

  • Zhang, K., Zhu, S., Kalander, M., Ng, I., Ye, J., Chen, Z., & Pan, L. (2021). gCastle: A Python Toolbox for Causal Discovery. arXiv:2111.15155.
  • Beaumont, P., Horsburgh, B., et al. (2020). CausalNex. QuantumBlack / McKinsey. GitHub: github.com/mckinsey/causalnex.
  • Zheng, X., Aragam, B., Ravikumar, P., & Xing, E.P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018.
  • Pearl, J. & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
  • Schölkopf, B. (2019). Causality for Machine Learning. arXiv:1911.10500.
  • Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., & Aragam, B. (2020). DYNOTEARS: Structure Learning from Time-Series Data. AISTATS 2020.
  • Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press, 2nd edition.
  • Reisach, A., Seiler, C., & Weichwald, S. (2021). Beware of the Simulated DAG! Varsortability and Its Implications. NeurIPS 2021.
  • PEP 450: Adding A Statistics Module To The Standard Library. peps.python.org/pep-0450.
  • PEP 484: Type Hints. peps.python.org/pep-0484.
  • PEP 465: A dedicated infix operator for matrix multiplication. peps.python.org/pep-0465.
cd ..