Causal Inference Part 6: End-to-End Practice — The Causal Pipeline in Code

Posted Apr 6, 2026

By Seongok Ryu

31 min read

Causal Inference — From First Principles to Automated Reasoning

This is Part 6 of an 8-part series.

Part 0: Beyond Correlation — Why Causal Inference?
Part 1: Potential Outcomes and the Fundamental Problem
Part 2: Structural Causal Models and DAGs
Part 3: Identification — From Assumptions to Estimands
Part 4: Estimation — From Estimands to Numbers
Part 5: Beyond the Average — Heterogeneous Effects and Causal Discovery
Part 6: End-to-End Practice — The Causal Pipeline in Code (this post)
Part 7: The Causal Inference Agent — When LLMs Meet Causality

Hook: The Bridge Between Theory and Practice

Theory without practice is philosophy. Practice without theory is gambling. This post is the bridge.

Parts 1-5 built a complete mathematical framework: potential outcomes, structural causal models, identification strategies, estimation methods, heterogeneous effects, and causal discovery. Every concept had a formal definition, an intuition, and an assumption list. But none of it ran on a computer.

This post translates every major concept from Parts 1-5 into working Python code that you can copy, run, and adapt to your own data. We implement the full DoWhy pipeline (Model, Identify, Estimate, Refute), build a difference-in-differences analysis with event study plots, run matching and doubly robust estimation with balance diagnostics, fit causal forests for heterogeneous effects, apply causal discovery algorithms, and stress-test every result with refutation.

The code is deliberately self-contained. Each section generates its own synthetic data with a known ground truth, so you can verify that the estimator recovers the true effect before applying it to real data.

1. The DoWhy Pipeline — Model, Identify, Estimate, Refute

The DoWhy library (Sharma & Kiciman, 2020) implements the four-step causal pipeline introduced in Part 0 and used throughout the series. Every causal analysis — regardless of estimator — should pass through these four gates before any conclusion is drawn.

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐
│   MODEL      │───>│   IDENTIFY    │───>│   ESTIMATE    │───>│   REFUTE    │
│              │    │              │    │              │    │             │
│ Declare the  │    │ Derive the   │    │ Compute the  │    │ Stress-test │
│ causal graph │    │ estimand     │    │ numerical    │    │ the result  │
│ (DAG)        │    │ from graph   │    │ estimate     │    │             │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘
    Part 2             Part 3             Part 4            Parts 4-6

1.1 Step 1 — Model: Define the Causal Graph

We start with synthetic data where the ground truth is known. The data-generating process is:

$\mathbf{X}_i \sim \mathcal{N}(0, 1)$ — a single confounder
$W_i \sim \text{Bernoulli}(\sigma(\mathbf{X}_i))$ — treatment depends on confounder via logistic link
$Y_i = 2\mathbf{X}_i + 5W_i + \varepsilon_i$ — outcome with true $\tau = 5.0$

  
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

# ── Generate synthetic data with known ground truth ──
np.random.seed(42)
N = 2000
X = np.random.randn(N)
W = (np.random.rand(N) < 1 / (1 + np.exp(-X))).astype(int)  # e(x) = sigmoid(x)
Y = 2 * X + 5 * W + np.random.randn(N)                       # true tau = 5.0

data = pd.DataFrame({"X": X, "W": W, "Y": Y})

# ── Step 1: Model — declare the causal graph ──
model = CausalModel(
    data=data,
    treatment="W",
    outcome="Y",
    common_causes=["X"]      # X confounds W -> Y
)
# model.view_model()  # renders the DAG (requires pygraphviz)

The graph encodes our assumptions, not our data. DoWhy uses this graph to determine which adjustment strategy is valid.

1.2 Step 2 — Identify: Derive the Estimand

DoWhy inspects the graph and returns a symbolic estimand — the mathematical expression that, under the stated assumptions, equals the causal effect.

  
# ── Step 2: Identify — find the estimand ──
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
# Output (abbreviated):
# Estimand type: EstimandType.NONPARAMETRIC_ATE
# Estimand expression:
#   d/dW[ E[Y | X] ]
# Backdoor variables: ['X']

The backdoor criterion from Part 2 tells us: conditioning on $\mathbf{X}$ blocks all confounding paths between $W_i$ and $Y_i$. The estimand is the backdoor adjustment formula:

\[\tau = E_{\mathbf{X}}[E[Y_i \mid W_i = 1, \mathbf{X}_i] - E[Y_i \mid W_i = 0, \mathbf{X}_i]]\]

1.3 Step 3 — Estimate: Compute the Numerical Effect

With the estimand in hand, we choose a statistical method to compute $\hat{\tau}$.

  
# ── Step 3: Estimate — propensity score stratification ──
estimate_ps = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_stratification"
)
print(f"Propensity score stratification: {estimate_ps.value:.3f}")

# ── Alternative: linear regression ──
estimate_lr = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)
print(f"Linear regression: {estimate_lr.value:.3f}")

Expected output (approximate):

Propensity score stratification: 4.95
Linear regression: 5.01

Both estimators recover the true $\tau = 5.0$ within sampling noise. The propensity score method uses the balancing property from Part 4; linear regression uses the backdoor adjustment directly.

1.4 Step 4 — Refute: Stress-Test the Result

A result without refutation is a claim without evidence. DoWhy provides four built-in refutation tests:

  
# ── Step 4: Refute — four tests ──

# (a) Random common cause: add a noise variable as a confounder
refute_random = model.refute_estimate(
    identified_estimand, estimate_lr,
    method_name="random_common_cause"
)
print(refute_random)

# (b) Placebo treatment: replace W with random noise
refute_placebo = model.refute_estimate(
    identified_estimand, estimate_lr,
    method_name="placebo_treatment_refuter",
    placebo_type="permute"
)
print(refute_placebo)

# (c) Data subset: re-estimate on a random 80% of the data
refute_subset = model.refute_estimate(
    identified_estimand, estimate_lr,
    method_name="data_subset_refuter",
    subset_fraction=0.8
)
print(refute_subset)

# (d) Bootstrap: check stability across resamples
refute_bootstrap = model.refute_estimate(
    identified_estimand, estimate_lr,
    method_name="bootstrap_refuter",
    num_simulations=100
)
print(refute_bootstrap)

Refutation Test	What It Checks	Pass Criterion
Random common cause	Robustness to an unobserved confounder	Estimate unchanged (p > 0.05)
Placebo treatment	Whether effect is specific to the real treatment	Effect near zero
Data subset	Stability across subsamples	Estimate within CI of original
Bootstrap	Finite-sample variability	Narrow CI around original estimate

If any refutation test fails, the causal claim is suspect. The placebo test is especially powerful: if replacing the real treatment with random noise still yields a “significant” effect, the original estimate is likely driven by specification error rather than causation.

2. Difference-in-Differences in Practice

Parts 3 and 4 developed the DiD estimator mathematically. Here we implement the full DiD workflow: generate panel data, estimate with two-way fixed effects, run event study diagnostics, and test pre-trends.

2.1 Generate Panel Data

  
import statsmodels.formula.api as smf

# ── Generate panel data: 100 units x 20 periods ──
np.random.seed(123)
n_units = 100
n_periods = 20
treat_period = 10       # treatment starts at t = 10
n_treated = 50          # first 50 units are treated
delta_true = 3.0        # true DiD effect

rows = []
for i in range(n_units):
    alpha_i = np.random.randn() * 2        # unit fixed effect
    for t in range(n_periods):
        gamma_t = 0.5 * t                  # time trend (shared)
        treated = int(i < n_treated)
        post = int(t >= treat_period)
        treated_post = treated * post
        # Parallel trends hold by construction:
        # Y_it = alpha_i + gamma_t + delta * D_it + noise
        Y = alpha_i + gamma_t + delta_true * treated_post + np.random.randn()
        rows.append({
            "unit_id": i, "time": t, "treated": treated,
            "post": post, "treated_post": treated_post, "Y": Y
        })

panel = pd.DataFrame(rows)
print(f"Panel shape: {panel.shape}")
print(f"Treated units: {panel[panel['treated']==1]['unit_id'].nunique()}")
print(f"Periods: {panel['time'].nunique()}, Treatment at t={treat_period}")

The causal DAG for this design:

    Treatment group (alpha_i)
          │
          ▼
  Unit FE ──────> Y_it <────── Time FE (gamma_t)
                   ▲
                   │
           delta * treated_post

2.2 Two-Way Fixed Effects OLS

The canonical DiD regression (Part 3, Equation 3.2):

\[Y_{it} = \alpha_i + \gamma_t + \delta \cdot (W_i \times \text{Post}_t) + \varepsilon_{it}\]

where $\alpha_i$ is the unit fixed effect, $\gamma_t$ is the time fixed effect, and $\delta$ is the DiD coefficient.

  
# ── TWFE DiD with clustered standard errors ──
formula = "Y ~ treated_post + C(unit_id) + C(time)"
did_model = smf.ols(formula, data=panel).fit(
    cov_type='cluster',
    cov_kwds={'groups': panel['unit_id']}
)

delta_hat = did_model.params['treated_post']
se_hat = did_model.bse['treated_post']
ci_low = delta_hat - 1.96 * se_hat
ci_high = delta_hat + 1.96 * se_hat

print(f"DiD estimate (delta): {delta_hat:.3f}")
print(f"Standard error (clustered): {se_hat:.3f}")
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
print(f"True delta: {delta_true}")

Pitfall: Standard errors must be clustered at the unit level (Bertrand, Duflo, and Mullainathan, 2004). Unclustered standard errors dramatically understate uncertainty because outcomes within a unit are serially correlated.

2.3 PPML for Count Outcomes

When $Y_{it}$ is a count (e.g., number of publications, patent counts), OLS on levels or log-transformed outcomes is problematic. PPML (Santos Silva and Tenreyro, 2006) handles zeros, is consistent under heteroskedasticity, and directly estimates multiplicative effects.

  
import statsmodels.api as sm

# ── Generate count outcome ──
np.random.seed(456)
panel["Y_count"] = np.random.poisson(
    lam=np.exp(0.5 + 0.3 * panel["treated_post"] + 0.02 * panel["time"])
)

# ── PPML (Poisson pseudo-maximum likelihood) ──
X_ppml = pd.get_dummies(panel[["treated_post", "time"]], columns=["time"], drop_first=True)
X_ppml = sm.add_constant(X_ppml)

ppml_model = sm.GLM(
    panel["Y_count"],
    X_ppml,
    family=sm.families.Poisson()
).fit()

print(f"PPML coefficient on treated_post: {ppml_model.params['treated_post']:.4f}")
print(f"Implied multiplicative effect: {np.exp(ppml_model.params['treated_post']):.4f}")

Method	Outcome Type	Handles Zeros	Consistent Under Heteroskedasticity
OLS on levels	Continuous	Yes	No (efficiency loss)
OLS on log(Y)	Continuous, Y > 0	No	No
PPML	Counts or continuous	Yes	Yes

2.4 Event Study — Testing Pre-Trends

The parallel trends assumption is untestable (Part 3), but we can probe it by estimating period-specific effects $\delta_k$ for each period relative to the treatment:

\[Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \delta_k \cdot W_i \cdot \mathbf{1}[t - t^* = k] + \varepsilon_{it}\]

where $t^*$ is the treatment period and $k = -1$ is the reference period.

  
# ── Event study: estimate delta_k for each relative period ──
panel["rel_time"] = panel["time"] - treat_period

# Create interaction dummies (exclude k = -1 as reference)
for k in range(-(treat_period), n_periods - treat_period):
    if k == -1:
        continue  # reference period
    col_name = f"rel_{k}" if k < 0 else f"rel_p{k}"
    panel[col_name] = (panel["treated"] * (panel["rel_time"] == k)).astype(int)

# Build formula: Y ~ sum of rel_k dummies + unit FE + time FE
rel_cols = [c for c in panel.columns if c.startswith("rel_")]
formula_es = "Y ~ " + " + ".join(rel_cols) + " + C(unit_id) + C(time)"

es_model = smf.ols(formula_es, data=panel).fit(
    cov_type='cluster',
    cov_kwds={'groups': panel['unit_id']}
)

# ── Extract event study coefficients ──
es_results = []
for col in sorted(rel_cols):
    k_val = col.replace("rel_p", "").replace("rel_", "")
    k_int = int(k_val)
    coef = es_model.params.get(col, 0)
    se = es_model.bse.get(col, 0)
    es_results.append({"k": k_int, "coef": coef, "se": se})

es_df = pd.DataFrame(es_results).sort_values("k")
# Add the reference period (k = -1, coef = 0)
es_df = pd.concat([es_df, pd.DataFrame([{"k": -1, "coef": 0, "se": 0}])])
es_df = es_df.sort_values("k").reset_index(drop=True)

print("Event Study Coefficients:")
print(es_df.to_string(index=False))

If the pre-treatment coefficients ($\delta_k$ for $k < -1$) are jointly indistinguishable from zero, parallel trends is supported. A visual check:

  Event Study Plot (conceptual)

  delta_k
    |          *  *  *  *  *  *  *  *  *  *    <- post-treatment: ~3.0
  3 |. . . . . . . . . . . . . . . *. . . . . .
    |                             /
  2 |                            /
    |                           /
  1 |                          |
    |                          |
  0 |--*--*--*--*--*--*--*--*--+------------------
    |                          |
 -1 |                          |
    +--+--+--+--+--+--+--+--+-+--+--+--+--+--+--
      -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3 ...
                                ↑
                           treatment onset

  Pre-trends: flat near zero    Post-treatment: jumps to delta

2.5 Pre-Trends F-Test

  
import scipy.stats as stats

# ── Joint F-test for pre-treatment coefficients ──
pre_cols = [c for c in rel_cols if c.startswith("rel_") and not c.startswith("rel_p")]
# Exclude rel_0 and positive (those start with rel_p)

pre_coefs = [es_model.params[c] for c in pre_cols if c in es_model.params]
pre_cov = es_model.cov_params().loc[pre_cols, pre_cols]

if len(pre_coefs) > 0:
    pre_coefs_arr = np.array(pre_coefs)
    pre_cov_arr = pre_cov.values
    # Wald test: H0: all pre-treatment coefficients = 0
    try:
        wald_stat = pre_coefs_arr @ np.linalg.inv(pre_cov_arr) @ pre_coefs_arr
        p_value = 1 - stats.chi2.cdf(wald_stat, df=len(pre_coefs))
        print(f"Pre-trends Wald test: chi2 = {wald_stat:.2f}, "
              f"df = {len(pre_coefs)}, p = {p_value:.4f}")
        print("Parallel trends supported" if p_value > 0.05
              else "WARNING: pre-trends detected")
    except np.linalg.LinAlgError:
        print("Covariance matrix singular — check collinearity")

3. Matching, IPW, and Doubly Robust in Practice

Parts 4 developed matching, inverse probability weighting (IPW), and the doubly robust estimator (AIPW) theoretically. Here we implement all three, starting from propensity score estimation and ending with the doubly robust estimator that is consistent if either the propensity model or the outcome model is correctly specified.

3.1 Propensity Score Estimation

The propensity score $e(\mathbf{x}) = P(W_i = 1 \mid \mathbf{X}_i = \mathbf{x})$ is the central object linking all three methods. We estimate it with logistic regression.

  
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# ── Generate richer confounded data ──
np.random.seed(789)
N = 3000
X1 = np.random.randn(N)
X2 = np.random.randn(N)
X3 = np.random.randn(N)
X_mat = np.column_stack([X1, X2, X3])

# Treatment depends on X1 and X2
logit = 0.5 * X1 + 0.3 * X2
e_true = 1 / (1 + np.exp(-logit))
W = (np.random.rand(N) < e_true).astype(int)

# Outcome depends on X1, X3, and treatment (true tau = 4.0)
Y = 1.5 * X1 + 0.8 * X3 + 4.0 * W + np.random.randn(N)

obs_data = pd.DataFrame({
    "X1": X1, "X2": X2, "X3": X3, "W": W, "Y": Y
})

# ── Estimate propensity score ──
scaler = StandardScaler()
X_scaled = scaler.fit_transform(obs_data[["X1", "X2", "X3"]])

ps_model = LogisticRegression(max_iter=1000, random_state=42)
ps_model.fit(X_scaled, obs_data["W"])
obs_data["ps"] = ps_model.predict_proba(X_scaled)[:, 1]

print(f"Propensity score range: [{obs_data['ps'].min():.3f}, {obs_data['ps'].max():.3f}]")
print(f"Mean PS (treated):   {obs_data.loc[obs_data['W']==1, 'ps'].mean():.3f}")
print(f"Mean PS (control):   {obs_data.loc[obs_data['W']==0, 'ps'].mean():.3f}")

3.2 Balance Diagnostics — Standardized Mean Differences

Before estimating anything, check covariate balance. The standardized mean difference (SMD) quantifies imbalance: values above 0.1 are traditionally considered problematic.

\[\text{SMD}_j = \frac{\bar{X}_{j,\text{treated}} - \bar{X}_{j,\text{control}}}{\sqrt{(s^2_{j,\text{treated}} + s^2_{j,\text{control}}) / 2}}\]

  
def compute_smd(data, covariates, treatment_col, weight_col=None):
    """Compute standardized mean differences for each covariate.

    Args:
        data: DataFrame with covariates and treatment
        covariates: list of covariate column names
        treatment_col: name of binary treatment column
        weight_col: optional IPW weight column

    Returns:
        DataFrame with SMD for each covariate
    """
    treated = data[data[treatment_col] == 1]
    control = data[data[treatment_col] == 0]
    results = []

    for cov in covariates:
        if weight_col is not None:
            mean_t = np.average(treated[cov], weights=treated[weight_col])
            mean_c = np.average(control[cov], weights=control[weight_col])
        else:
            mean_t = treated[cov].mean()
            mean_c = control[cov].mean()
        var_t = treated[cov].var()
        var_c = control[cov].var()
        smd = (mean_t - mean_c) / np.sqrt((var_t + var_c) / 2)
        results.append({"covariate": cov, "mean_treated": mean_t,
                        "mean_control": mean_c, "SMD": smd})
    return pd.DataFrame(results)

# ── Unadjusted balance ──
covariates = ["X1", "X2", "X3"]
balance_raw = compute_smd(obs_data, covariates, "W")
print("=== Unadjusted Balance ===")
print(balance_raw.to_string(index=False))

Expected output (approximate):

=== Unadjusted Balance ===
covariate  mean_treated  mean_control    SMD
       X1         0.33         -0.33   0.66   <-- confounded
       X2         0.19         -0.18   0.37   <-- confounded
       X3         0.01         -0.01   0.02   <-- balanced (not a confounder of W)

X1 and X2 are imbalanced (SMD > 0.1) because they drive treatment assignment. X3 is balanced because it does not affect treatment.

  Conceptual Balance Plot (Love Plot)

  Covariate     |---- Unadjusted ----|---- IPW-Adjusted ----|
                 <---0.1---0---0.1--->
       X1             *                          o
       X2             *                          o
       X3                    *o

  * = unadjusted SMD     o = IPW-adjusted SMD
  Goal: all |SMD| < 0.1 after adjustment

3.3 IPW Estimation

The IPW estimator (Horvitz-Thompson, Part 4) re-weights each observation by the inverse of its propensity score:

\[\hat{\tau}_{\text{IPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{W_i Y_i}{\hat{e}(\mathbf{x}_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(\mathbf{x}_i)} \right]\]

  
# ── IPW estimator ──
def ipw_ate(data, outcome_col, treatment_col, ps_col, trim=0.01):
    """Compute ATE via inverse probability weighting.

    Args:
        data: DataFrame
        outcome_col: outcome column name
        treatment_col: treatment column name
        ps_col: propensity score column name
        trim: trim extreme propensity scores for stability

    Returns:
        float: estimated ATE
    """
    df = data.copy()
    # Trim extreme propensity scores to avoid division by near-zero
    df[ps_col] = df[ps_col].clip(lower=trim, upper=1 - trim)

    W = df[treatment_col].values
    Y = df[outcome_col].values
    ps = df[ps_col].values

    # Horvitz-Thompson estimator
    tau_hat = np.mean(W * Y / ps - (1 - W) * Y / (1 - ps))
    return tau_hat

tau_ipw = ipw_ate(obs_data, "Y", "W", "ps")
print(f"IPW estimate: {tau_ipw:.3f}  (true tau = 4.0)")

# ── Check balance after IPW ──
obs_data["ipw_weight"] = np.where(
    obs_data["W"] == 1,
    1 / obs_data["ps"].clip(0.01, 0.99),
    1 / (1 - obs_data["ps"].clip(0.01, 0.99))
)
balance_ipw = compute_smd(obs_data, covariates, "W", weight_col="ipw_weight")
print("\n=== IPW-Adjusted Balance ===")
print(balance_ipw.to_string(index=False))

3.4 AIPW — The Doubly Robust Estimator

The AIPW estimator (Robins, Rotnitzky, and Zhao, 1994) combines an outcome model $\hat{\mu}_w(\mathbf{x})$ with the propensity score $\hat{e}(\mathbf{x})$:

\[\hat{\tau}_{\text{AIPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(\mathbf{x}_i) - \hat{\mu}_0(\mathbf{x}_i) + \frac{W_i(Y_i - \hat{\mu}_1(\mathbf{x}_i))}{\hat{e}(\mathbf{x}_i)} - \frac{(1-W_i)(Y_i - \hat{\mu}_0(\mathbf{x}_i))}{1 - \hat{e}(\mathbf{x}_i)} \right]\]

AIPW is consistent if either the propensity model or the outcome model is correctly specified — hence “doubly robust.”

  
from sklearn.linear_model import LinearRegression

def aipw_ate(data, outcome_col, treatment_col, covariates, ps_col, trim=0.01):
    """Compute ATE via augmented inverse probability weighting (AIPW).

    Args:
        data: DataFrame
        outcome_col, treatment_col: column names
        covariates: list of covariate column names
        ps_col: propensity score column name
        trim: propensity score trimming threshold

    Returns:
        float: estimated ATE
    """
    df = data.copy()
    df[ps_col] = df[ps_col].clip(lower=trim, upper=1 - trim)

    W = df[treatment_col].values
    Y = df[outcome_col].values
    X = df[covariates].values
    ps = df[ps_col].values

    # Fit outcome models mu_1(x) and mu_0(x)
    treated_idx = W == 1
    control_idx = W == 0

    model_1 = LinearRegression().fit(X[treated_idx], Y[treated_idx])
    model_0 = LinearRegression().fit(X[control_idx], Y[control_idx])

    mu1_hat = model_1.predict(X)
    mu0_hat = model_0.predict(X)

    # AIPW formula
    tau_hat = np.mean(
        (mu1_hat - mu0_hat)
        + W * (Y - mu1_hat) / ps
        - (1 - W) * (Y - mu0_hat) / (1 - ps)
    )
    return tau_hat

tau_aipw = aipw_ate(obs_data, "Y", "W", covariates, "ps")
print(f"AIPW estimate: {tau_aipw:.3f}  (true tau = 4.0)")

Estimator	What It Requires	Double Robustness	Variance
Regression adjustment	Correct outcome model $\hat{\mu}_w$	No	Low if model correct
IPW	Correct propensity model $\hat{e}$	No	Can be high (extreme weights)
AIPW	Both models, but only one needs to be correct	Yes	Semiparametric efficient

4. DML and Causal Forests in Practice

Double/Debiased Machine Learning (Chernozhukov et al., 2018) allows any ML model for nuisance estimation while maintaining valid inference for the causal parameter. Causal forests (Wager and Athey, 2018) extend this to heterogeneous effects.

4.1 CausalForestDML — End-to-End

EconML’s CausalForestDML combines DML’s cross-fitting with a causal forest for CATE estimation:

  
from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# ── Generate data with heterogeneous effects ──
np.random.seed(2026)
N = 5000
X1 = np.random.randn(N)
X2 = np.random.randn(N)
X3 = np.random.uniform(0, 1, N)
X_features = np.column_stack([X1, X2, X3])

# Treatment: depends on X1
T = (np.random.rand(N) < 1 / (1 + np.exp(-0.5 * X1))).astype(int)

# True CATE: tau(x) = 3 + 2*X1 (heterogeneous in X1, not X2 or X3)
tau_true = 3 + 2 * X1
Y = 1.0 * X1 + 0.5 * X2 + tau_true * T + np.random.randn(N)

print(f"True ATE: {tau_true.mean():.3f}")
print(f"True CATE range: [{tau_true.min():.2f}, {tau_true.max():.2f}]")

  
# ── Fit CausalForestDML ──
est = CausalForestDML(
    model_y=GradientBoostingRegressor(
        n_estimators=100, max_depth=4, random_state=42
    ),
    model_t=GradientBoostingClassifier(
        n_estimators=100, max_depth=4, random_state=42
    ),
    n_estimators=200,
    min_samples_leaf=10,
    random_state=42
)
est.fit(Y, T, X=X_features)

# ── ATE with confidence interval ──
ate_point = est.ate(X=X_features)
ate_inf = est.ate_inference(X=X_features)
print(f"Estimated ATE: {ate_point:.3f}")
print(f"ATE 95% CI: [{ate_inf.conf_int()[0]:.3f}, {ate_inf.conf_int()[1]:.3f}]")

# ── CATE for individual units ──
cate_hat = est.effect(X_features)
print(f"\nCATE summary:")
print(f"  Mean:   {cate_hat.mean():.3f}")
print(f"  Std:    {cate_hat.std():.3f}")
print(f"  Range:  [{cate_hat.min():.2f}, {cate_hat.max():.2f}]")

4.2 CATE Inspection by Subgroup

The practical payoff of heterogeneous effect estimation: identify who benefits most from treatment.

  
# ── CATE by subgroup: split on X1 (the true effect modifier) ──
X1_quantiles = pd.qcut(X1, q=5, labels=["Q1 (low)", "Q2", "Q3", "Q4", "Q5 (high)"])
cate_by_group = pd.DataFrame({
    "X1_quantile": X1_quantiles,
    "CATE_hat": cate_hat,
    "CATE_true": tau_true
})

summary = cate_by_group.groupby("X1_quantile").agg(
    CATE_estimated=("CATE_hat", "mean"),
    CATE_true=("CATE_true", "mean"),
    n=("CATE_hat", "count")
).round(3)

print("\n=== CATE by X1 Quantile ===")
print(summary.to_string())

Expected output (approximate):

=== CATE by X1 Quantile ===
             CATE_estimated  CATE_true     n
X1_quantile
Q1 (low)              0.35       0.36  1000
Q2                    1.78       1.78  1000
Q3                    2.97       3.00  1000
Q4                    4.18       4.22  1000
Q5 (high)             5.55       5.64  1000

The causal forest recovers the monotonic relationship between X1 and the treatment effect. Units in Q5 benefit roughly 15 times more than units in Q1.

4.3 Feature Importance for Heterogeneity

  
# ── Which features drive heterogeneity? ──
feature_names = ["X1", "X2", "X3"]
importances = est.feature_importances_

print("\n=== Feature Importance for Treatment Effect Heterogeneity ===")
for name, imp in sorted(zip(feature_names, importances),
                        key=lambda x: -x[1]):
    bar = "#" * int(imp * 50)
    print(f"  {name:4s}: {imp:.3f}  {bar}")

Expected output:

=== Feature Importance for Treatment Effect Heterogeneity ===
  X1  : 0.82  #########################################
  X2  : 0.10  #####
  X3  : 0.08  ####

X1 dominates because it is the only true effect modifier. X2 and X3 receive small importance from noise-fitting.

4.4 Meta-Learners via CausalML

For comparison, we can estimate CATE with meta-learners (Kunzel et al., 2019). The S-learner, T-learner, and X-learner offer different bias-variance tradeoffs.

  
from causalml.inference.meta import (
    BaseSRegressor,
    BaseTRegressor,
    BaseXRegressor
)
from sklearn.ensemble import GradientBoostingRegressor as GBR

# ── S-Learner: single model with treatment as a feature ──
s_learner = BaseSRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_s = s_learner.estimate_individual_effect(X_features, T, Y)

# ── T-Learner: separate models for treated and control ──
t_learner = BaseTRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_t = t_learner.estimate_individual_effect(X_features, T, Y)

# ── X-Learner: impute counterfactuals, then regress ──
x_learner = BaseXRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_x = x_learner.estimate_individual_effect(X_features, T, Y)

print("=== Meta-Learner ATE Comparison ===")
print(f"  S-Learner ATE: {cate_s.mean():.3f}")
print(f"  T-Learner ATE: {cate_t.mean():.3f}")
print(f"  X-Learner ATE: {cate_x.mean():.3f}")
print(f"  CausalForest ATE: {cate_hat.mean():.3f}")
print(f"  True ATE:       {tau_true.mean():.3f}")

Meta-Learner	Strategy	Best When
S-Learner	Single model, treatment as feature	Large N, small heterogeneity
T-Learner	Separate models per treatment arm	Balanced groups, distinct response surfaces
X-Learner	Cross-impute counterfactuals	Unequal group sizes
Causal Forest	Honest splitting on treatment effect	Moderate to large N, tree-compatible heterogeneity

5. Causal Discovery in Practice

Parts 2 and 5 introduced causal discovery — learning the DAG from data. Here we apply the PC algorithm and compare the learned graph to a known ground truth.

5.1 The PC Algorithm

The PC algorithm (Spirtes, Glymour, and Scheines, 2000) starts with a complete undirected graph and removes edges via conditional independence tests.

  
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz

# ── Generate data from a known DAG: X1 -> X2 -> X3, X1 -> X3 ──
np.random.seed(42)
N_disc = 2000
X1_disc = np.random.randn(N_disc)
X2_disc = 0.8 * X1_disc + np.random.randn(N_disc) * 0.5
X3_disc = 0.5 * X1_disc + 0.6 * X2_disc + np.random.randn(N_disc) * 0.5

data_disc = np.column_stack([X1_disc, X2_disc, X3_disc])
labels = ["X1", "X2", "X3"]

# ── Run PC algorithm ──
cg = pc(data_disc, alpha=0.05, indep_test=fisherz, node_names=labels)

# ── Print adjacency matrix ──
adj = cg.G.graph
print("Adjacency matrix (PC algorithm output):")
print(f"Labels: {labels}")
print(adj)

The output adjacency matrix encodes:

1 in position (i, j) with -1 in (j, i): directed edge $i \to j$
1 in both (i, j) and (j, i): undirected edge (Markov equivalence)

  True DAG vs. PC Output (conceptual)

  True DAG:                 PC Output (CPDAG):
  X1 ──> X2                 X1 ──> X2
  │       │                 │       │
  │       ▼                 │       ▼
  └─────> X3                └─────> X3

  In this case, PC recovers the correct structure
  because the v-structure at X3 orients both edges.

5.2 Comparing Discovery Algorithms

  
from causallearn.search.ScoreBased.GES import ges

# ── GES (Greedy Equivalence Search) ──
record = ges(data_disc, score_func='local_score_BIC')
adj_ges = record['G'].graph

print("\nGES adjacency matrix:")
print(adj_ges)

# ── Compare: count edge agreements ──
def compare_adjacency(adj1, adj2, labels):
    """Compare two adjacency matrices and report matches."""
    n = len(labels)
    agree = 0
    disagree = 0
    for i in range(n):
        for j in range(i + 1, n):
            edge1 = (adj1[i, j] != 0) or (adj1[j, i] != 0)
            edge2 = (adj2[i, j] != 0) or (adj2[j, i] != 0)
            if edge1 == edge2:
                agree += 1
            else:
                disagree += 1
                print(f"  Disagreement: {labels[i]} -- {labels[j]}: "
                      f"PC={edge1}, GES={edge2}")
    print(f"Edge agreement: {agree}/{agree + disagree}")

compare_adjacency(adj, adj_ges, labels)

Algorithm	Type	Handles Latent Confounders	Output
PC	Constraint-based	No (assumes causal sufficiency)	CPDAG
FCI	Constraint-based	Yes	PAG
GES	Score-based (BIC)	No	CPDAG

Pitfall: Causal discovery algorithms return Markov equivalence classes (CPDAGs), not unique DAGs. Multiple DAGs can encode the same conditional independence relations. The orientation of some edges remains ambiguous without additional assumptions (e.g., non-Gaussianity for LiNGAM, temporal ordering).

6. Refutation and Sensitivity Analysis

No estimate is credible without stress-testing. This section implements a comprehensive refutation workflow and introduces the E-value for sensitivity to unmeasured confounding.

6.1 DoWhy Refutation — Full Walkthrough

We return to the DoWhy model from Section 1 and run all four refuters with interpretation.

  
# ── Rebuild the DoWhy model (from Section 1) ──
np.random.seed(42)
N = 2000
X = np.random.randn(N)
W = (np.random.rand(N) < 1 / (1 + np.exp(-X))).astype(int)
Y = 2 * X + 5 * W + np.random.randn(N)
data = pd.DataFrame({"X": X, "W": W, "Y": Y})

model = CausalModel(data=data, treatment="W", outcome="Y",
                     common_causes=["X"])
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)
print(f"Original estimate: {estimate.value:.3f}")

# ── Refutation 1: Random common cause ──
print("\n--- Refutation 1: Random Common Cause ---")
ref1 = model.refute_estimate(
    identified_estimand, estimate,
    method_name="random_common_cause"
)
print(ref1)

# ── Refutation 2: Placebo treatment ──
print("\n--- Refutation 2: Placebo Treatment ---")
ref2 = model.refute_estimate(
    identified_estimand, estimate,
    method_name="placebo_treatment_refuter",
    placebo_type="permute"
)
print(ref2)

# ── Refutation 3: Data subset ──
print("\n--- Refutation 3: Data Subset ---")
ref3 = model.refute_estimate(
    identified_estimand, estimate,
    method_name="data_subset_refuter",
    subset_fraction=0.8
)
print(ref3)

# ── Refutation 4: Bootstrap ──
print("\n--- Refutation 4: Bootstrap ---")
ref4 = model.refute_estimate(
    identified_estimand, estimate,
    method_name="bootstrap_refuter",
    num_simulations=200
)
print(ref4)

6.2 Interpreting Refutation Results

  Refutation Decision Tree

  Random common cause:
    Estimate changes significantly? ──> YES ──> Possible unmeasured confounding
                                   └── NO  ──> Passes (continue)

  Placebo treatment:
    Placebo effect near zero?       ──> YES ──> Passes (effect is treatment-specific)
                                   └── NO  ──> Specification error likely

  Data subset:
    Estimate within original CI?    ──> YES ──> Passes (stable)
                                   └── NO  ──> Influential observations present

  Bootstrap:
    CI is narrow and covers original?──> YES ──> Passes (precise)
                                    └── NO  ──> High variance or instability

Refutation	Tests For	Failure Implication
Random common cause	Sensitivity to unobserved confounders	Model may be missing a key confounder
Placebo treatment	Specificity of the treatment	Effect may be an artifact of model specification
Data subset	Influence of individual observations	Estimate driven by outliers or small subgroup
Bootstrap	Finite-sample stability	Large variance makes point estimate unreliable

6.3 The E-Value — Sensitivity to Unmeasured Confounding

The E-value (VanderWeele and Ding, 2017) asks: how strong would an unmeasured confounder need to be — in terms of its associations with both treatment and outcome — to explain away the observed effect?

For a risk ratio $\text{RR}$:

\[E\text{-value} = \text{RR} + \sqrt{\text{RR} \times (\text{RR} - 1)}\]

A large E-value means only a very strong unmeasured confounder could nullify the result.

  
def compute_e_value(rr):
    """Compute the E-value for a given risk ratio.

    The E-value is the minimum strength of association that an
    unmeasured confounder would need to have with both treatment
    and outcome (on the RR scale) to fully explain away the
    observed effect.

    Args:
        rr: observed risk ratio (must be >= 1; if < 1, invert)

    Returns:
        float: E-value
    """
    if rr < 1:
        rr = 1 / rr
    return rr + np.sqrt(rr * (rr - 1))

# ── Example: convert our ATE to an approximate risk ratio ──
# For a continuous outcome, we can use the approximate conversion:
# RR approx exp(tau / SD(Y))
sd_y = obs_data["Y"].std()
tau_est = 4.0  # our estimated ATE
rr_approx = np.exp(tau_est / sd_y)

e_val = compute_e_value(rr_approx)
print(f"Approximate RR: {rr_approx:.2f}")
print(f"E-value: {e_val:.2f}")
print(f"\nInterpretation: An unmeasured confounder would need to be")
print(f"associated with both treatment and outcome by a factor of")
print(f"at least {e_val:.1f} (on the RR scale) to explain away the effect.")

6.4 Putting It All Together — The Refutation Checklist

Every causal analysis in this series should report the following:

Checklist Item	Status	Tool
Propensity score overlap checked	[ ]	Histogram / trimming
Covariate balance achieved (SMD < 0.1)	[ ]	`compute_smd()`
Random common cause refutation passed	[ ]	DoWhy
Placebo treatment refutation passed	[ ]	DoWhy
Data subset refutation passed	[ ]	DoWhy
E-value reported	[ ]	`compute_e_value()`
Pre-trends test passed (if DiD)	[ ]	F-test / event study

Library Reference

The complete mapping from method to library:

Method	Library	Key Class / Function	Series Part
Full pipeline (Model-Identify-Estimate-Refute)	DoWhy	`CausalModel`	6 (Sec. 1)
DiD with TWFE	statsmodels	`smf.ols()` with `C(unit_id)`	6 (Sec. 2)
PPML for counts	statsmodels	`sm.GLM(family=Poisson())`	6 (Sec. 2)
Propensity score estimation	scikit-learn	`LogisticRegression`	6 (Sec. 3)
IPW	Custom / DoWhy	`ipw_ate()` above	6 (Sec. 3)
AIPW (doubly robust)	Custom / DoWhy	`aipw_ate()` above	6 (Sec. 3)
DML + Causal Forests	EconML	`CausalForestDML`	6 (Sec. 4)
Meta-learners (S/T/X)	CausalML	`BaseSRegressor`, `BaseTRegressor`, `BaseXRegressor`	6 (Sec. 4)
PC algorithm	causal-learn	`pc()`	6 (Sec. 5)
GES	causal-learn	`ges()`	6 (Sec. 5)
Refutation tests	DoWhy	`model.refute_estimate()`	6 (Sec. 6)
E-value	Custom	`compute_e_value()` above	6 (Sec. 6)

  Installation (one command):

  pip install dowhy econml causalml causal-learn statsmodels scikit-learn

Summary

This post translated five posts of theory into working code. The key implementation patterns:

The DoWhy pipeline enforces discipline. Model your assumptions as a graph before estimating anything. Identify the estimand before choosing an estimator. Refute the result before reporting it. The four-step structure is not bureaucracy — it is the difference between a causal claim and a correlation dressed up in causal language.
DiD requires more than a regression coefficient. Cluster standard errors at the unit level. Run an event study to visualize pre-trends. Test parallel trends with a joint F-test. Use PPML for count outcomes.
Always check balance before trusting an IPW or matching estimate. The SMD table is the first diagnostic, not the last. AIPW provides insurance through double robustness, but neither model being correct is guaranteed — check both.
Causal forests reveal who benefits, not just whether treatment works. Feature importance for heterogeneity is a fundamentally different quantity from predictive feature importance. The former asks “which covariates modify the treatment effect?” — the latter asks “which covariates predict the outcome?”
Refutation is not optional. Every estimate in this post passed four refutation tests plus an E-value assessment. A result without refutation is a claim without evidence.

Part 7 takes the next step: can an LLM-driven agent automate this entire pipeline — from reading a dataset to proposing a causal graph, selecting an estimator, running the code, and interpreting the refutation results?

References

Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics, 119(1), 249-275.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1), C1-C68.
Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156-4165.
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846-866.
Santos Silva, J. M. C., & Tenreyro, S. (2006). The log of gravity. Review of Economics and Statistics, 88(4), 641-658.
Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216.
Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd ed.). MIT Press.
VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.

AI & ML, ML Theory

This post is licensed under CC BY 4.0 by the author.

Hook: The Bridge Between Theory and Practice

1. The DoWhy Pipeline — Model, Identify, Estimate, Refute

1.1 Step 1 — Model: Define the Causal Graph

1.2 Step 2 — Identify: Derive the Estimand

1.3 Step 3 — Estimate: Compute the Numerical Effect

1.4 Step 4 — Refute: Stress-Test the Result

2. Difference-in-Differences in Practice

2.1 Generate Panel Data

2.2 Two-Way Fixed Effects OLS

2.3 PPML for Count Outcomes

2.4 Event Study — Testing Pre-Trends

2.5 Pre-Trends F-Test

3. Matching, IPW, and Doubly Robust in Practice

3.1 Propensity Score Estimation

3.2 Balance Diagnostics — Standardized Mean Differences

3.3 IPW Estimation

3.4 AIPW — The Doubly Robust Estimator

4. DML and Causal Forests in Practice

4.1 CausalForestDML — End-to-End

4.2 CATE Inspection by Subgroup

4.3 Feature Importance for Heterogeneity

4.4 Meta-Learners via CausalML

5. Causal Discovery in Practice

5.1 The PC Algorithm

5.2 Comparing Discovery Algorithms

6. Refutation and Sensitivity Analysis

6.1 DoWhy Refutation — Full Walkthrough

6.2 Interpreting Refutation Results

6.3 The E-Value — Sensitivity to Unmeasured Confounding

6.4 Putting It All Together — The Refutation Checklist

Library Reference

Summary

References

Trending Tags