Causal Inference Part 6: End-to-End Practice — The Causal Pipeline in Code
Causal Inference — From First Principles to Automated Reasoning
This is Part 6 of an 8-part series.
- Part 0: Beyond Correlation — Why Causal Inference?
- Part 1: Potential Outcomes and the Fundamental Problem
- Part 2: Structural Causal Models and DAGs
- Part 3: Identification — From Assumptions to Estimands
- Part 4: Estimation — From Estimands to Numbers
- Part 5: Beyond the Average — Heterogeneous Effects and Causal Discovery
- Part 6: End-to-End Practice — The Causal Pipeline in Code (this post)
- Part 7: The Causal Inference Agent — When LLMs Meet Causality
Hook: The Bridge Between Theory and Practice
Theory without practice is philosophy. Practice without theory is gambling. This post is the bridge.
Parts 1-5 built a complete mathematical framework: potential outcomes, structural causal models, identification strategies, estimation methods, heterogeneous effects, and causal discovery. Every concept had a formal definition, an intuition, and an assumption list. But none of it ran on a computer.
This post translates every major concept from Parts 1-5 into working Python code that you can copy, run, and adapt to your own data. We implement the full DoWhy pipeline (Model, Identify, Estimate, Refute), build a difference-in-differences analysis with event study plots, run matching and doubly robust estimation with balance diagnostics, fit causal forests for heterogeneous effects, apply causal discovery algorithms, and stress-test every result with refutation.
The code is deliberately self-contained. Each section generates its own synthetic data with a known ground truth, so you can verify that the estimator recovers the true effect before applying it to real data.
1. The DoWhy Pipeline — Model, Identify, Estimate, Refute
The DoWhy library (Sharma & Kiciman, 2020) implements the four-step causal pipeline introduced in Part 0 and used throughout the series. Every causal analysis — regardless of estimator — should pass through these four gates before any conclusion is drawn.
1
2
3
4
5
6
7
8
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ MODEL │───>│ IDENTIFY │───>│ ESTIMATE │───>│ REFUTE │
│ │ │ │ │ │ │ │
│ Declare the │ │ Derive the │ │ Compute the │ │ Stress-test │
│ causal graph │ │ estimand │ │ numerical │ │ the result │
│ (DAG) │ │ from graph │ │ estimate │ │ │
└─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘
Part 2 Part 3 Part 4 Parts 4-6
1.1 Step 1 — Model: Define the Causal Graph
We start with synthetic data where the ground truth is known. The data-generating process is:
- $\mathbf{X}_i \sim \mathcal{N}(0, 1)$ — a single confounder
- $W_i \sim \text{Bernoulli}(\sigma(\mathbf{X}_i))$ — treatment depends on confounder via logistic link
- $Y_i = 2\mathbf{X}_i + 5W_i + \varepsilon_i$ — outcome with true $\tau = 5.0$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np
# ── Generate synthetic data with known ground truth ──
np.random.seed(42)
N = 2000
X = np.random.randn(N)
W = (np.random.rand(N) < 1 / (1 + np.exp(-X))).astype(int) # e(x) = sigmoid(x)
Y = 2 * X + 5 * W + np.random.randn(N) # true tau = 5.0
data = pd.DataFrame({"X": X, "W": W, "Y": Y})
# ── Step 1: Model — declare the causal graph ──
model = CausalModel(
data=data,
treatment="W",
outcome="Y",
common_causes=["X"] # X confounds W -> Y
)
# model.view_model() # renders the DAG (requires pygraphviz)
The graph encodes our assumptions, not our data. DoWhy uses this graph to determine which adjustment strategy is valid.
1.2 Step 2 — Identify: Derive the Estimand
DoWhy inspects the graph and returns a symbolic estimand — the mathematical expression that, under the stated assumptions, equals the causal effect.
1
2
3
4
5
6
7
8
# ── Step 2: Identify — find the estimand ──
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
# Output (abbreviated):
# Estimand type: EstimandType.NONPARAMETRIC_ATE
# Estimand expression:
# d/dW[ E[Y | X] ]
# Backdoor variables: ['X']
The backdoor criterion from Part 2 tells us: conditioning on $\mathbf{X}$ blocks all confounding paths between $W_i$ and $Y_i$. The estimand is the backdoor adjustment formula:
\[\tau = E_{\mathbf{X}}[E[Y_i \mid W_i = 1, \mathbf{X}_i] - E[Y_i \mid W_i = 0, \mathbf{X}_i]]\]1.3 Step 3 — Estimate: Compute the Numerical Effect
With the estimand in hand, we choose a statistical method to compute $\hat{\tau}$.
1
2
3
4
5
6
7
8
9
10
11
12
13
# ── Step 3: Estimate — propensity score stratification ──
estimate_ps = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_stratification"
)
print(f"Propensity score stratification: {estimate_ps.value:.3f}")
# ── Alternative: linear regression ──
estimate_lr = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression"
)
print(f"Linear regression: {estimate_lr.value:.3f}")
Expected output (approximate):
1
2
Propensity score stratification: 4.95
Linear regression: 5.01
Both estimators recover the true $\tau = 5.0$ within sampling noise. The propensity score method uses the balancing property from Part 4; linear regression uses the backdoor adjustment directly.
1.4 Step 4 — Refute: Stress-Test the Result
A result without refutation is a claim without evidence. DoWhy provides four built-in refutation tests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# ── Step 4: Refute — four tests ──
# (a) Random common cause: add a noise variable as a confounder
refute_random = model.refute_estimate(
identified_estimand, estimate_lr,
method_name="random_common_cause"
)
print(refute_random)
# (b) Placebo treatment: replace W with random noise
refute_placebo = model.refute_estimate(
identified_estimand, estimate_lr,
method_name="placebo_treatment_refuter",
placebo_type="permute"
)
print(refute_placebo)
# (c) Data subset: re-estimate on a random 80% of the data
refute_subset = model.refute_estimate(
identified_estimand, estimate_lr,
method_name="data_subset_refuter",
subset_fraction=0.8
)
print(refute_subset)
# (d) Bootstrap: check stability across resamples
refute_bootstrap = model.refute_estimate(
identified_estimand, estimate_lr,
method_name="bootstrap_refuter",
num_simulations=100
)
print(refute_bootstrap)
| Refutation Test | What It Checks | Pass Criterion |
|---|---|---|
| Random common cause | Robustness to an unobserved confounder | Estimate unchanged (p > 0.05) |
| Placebo treatment | Whether effect is specific to the real treatment | Effect near zero |
| Data subset | Stability across subsamples | Estimate within CI of original |
| Bootstrap | Finite-sample variability | Narrow CI around original estimate |
If any refutation test fails, the causal claim is suspect. The placebo test is especially powerful: if replacing the real treatment with random noise still yields a “significant” effect, the original estimate is likely driven by specification error rather than causation.
2. Difference-in-Differences in Practice
Parts 3 and 4 developed the DiD estimator mathematically. Here we implement the full DiD workflow: generate panel data, estimate with two-way fixed effects, run event study diagnostics, and test pre-trends.
2.1 Generate Panel Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import statsmodels.formula.api as smf
# ── Generate panel data: 100 units x 20 periods ──
np.random.seed(123)
n_units = 100
n_periods = 20
treat_period = 10 # treatment starts at t = 10
n_treated = 50 # first 50 units are treated
delta_true = 3.0 # true DiD effect
rows = []
for i in range(n_units):
alpha_i = np.random.randn() * 2 # unit fixed effect
for t in range(n_periods):
gamma_t = 0.5 * t # time trend (shared)
treated = int(i < n_treated)
post = int(t >= treat_period)
treated_post = treated * post
# Parallel trends hold by construction:
# Y_it = alpha_i + gamma_t + delta * D_it + noise
Y = alpha_i + gamma_t + delta_true * treated_post + np.random.randn()
rows.append({
"unit_id": i, "time": t, "treated": treated,
"post": post, "treated_post": treated_post, "Y": Y
})
panel = pd.DataFrame(rows)
print(f"Panel shape: {panel.shape}")
print(f"Treated units: {panel[panel['treated']==1]['unit_id'].nunique()}")
print(f"Periods: {panel['time'].nunique()}, Treatment at t={treat_period}")
The causal DAG for this design:
1
2
3
4
5
6
7
Treatment group (alpha_i)
│
▼
Unit FE ──────> Y_it <────── Time FE (gamma_t)
▲
│
delta * treated_post
2.2 Two-Way Fixed Effects OLS
The canonical DiD regression (Part 3, Equation 3.2):
\[Y_{it} = \alpha_i + \gamma_t + \delta \cdot (W_i \times \text{Post}_t) + \varepsilon_{it}\]where $\alpha_i$ is the unit fixed effect, $\gamma_t$ is the time fixed effect, and $\delta$ is the DiD coefficient.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ── TWFE DiD with clustered standard errors ──
formula = "Y ~ treated_post + C(unit_id) + C(time)"
did_model = smf.ols(formula, data=panel).fit(
cov_type='cluster',
cov_kwds={'groups': panel['unit_id']}
)
delta_hat = did_model.params['treated_post']
se_hat = did_model.bse['treated_post']
ci_low = delta_hat - 1.96 * se_hat
ci_high = delta_hat + 1.96 * se_hat
print(f"DiD estimate (delta): {delta_hat:.3f}")
print(f"Standard error (clustered): {se_hat:.3f}")
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
print(f"True delta: {delta_true}")
Pitfall: Standard errors must be clustered at the unit level (Bertrand, Duflo, and Mullainathan, 2004). Unclustered standard errors dramatically understate uncertainty because outcomes within a unit are serially correlated.
2.3 PPML for Count Outcomes
When $Y_{it}$ is a count (e.g., number of publications, patent counts), OLS on levels or log-transformed outcomes is problematic. PPML (Santos Silva and Tenreyro, 2006) handles zeros, is consistent under heteroskedasticity, and directly estimates multiplicative effects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import statsmodels.api as sm
# ── Generate count outcome ──
np.random.seed(456)
panel["Y_count"] = np.random.poisson(
lam=np.exp(0.5 + 0.3 * panel["treated_post"] + 0.02 * panel["time"])
)
# ── PPML (Poisson pseudo-maximum likelihood) ──
X_ppml = pd.get_dummies(panel[["treated_post", "time"]], columns=["time"], drop_first=True)
X_ppml = sm.add_constant(X_ppml)
ppml_model = sm.GLM(
panel["Y_count"],
X_ppml,
family=sm.families.Poisson()
).fit()
print(f"PPML coefficient on treated_post: {ppml_model.params['treated_post']:.4f}")
print(f"Implied multiplicative effect: {np.exp(ppml_model.params['treated_post']):.4f}")
| Method | Outcome Type | Handles Zeros | Consistent Under Heteroskedasticity |
|---|---|---|---|
| OLS on levels | Continuous | Yes | No (efficiency loss) |
| OLS on log(Y) | Continuous, Y > 0 | No | No |
| PPML | Counts or continuous | Yes | Yes |
2.4 Event Study — Testing Pre-Trends
The parallel trends assumption is untestable (Part 3), but we can probe it by estimating period-specific effects $\delta_k$ for each period relative to the treatment:
\[Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \delta_k \cdot W_i \cdot \mathbf{1}[t - t^* = k] + \varepsilon_{it}\]where $t^*$ is the treatment period and $k = -1$ is the reference period.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# ── Event study: estimate delta_k for each relative period ──
panel["rel_time"] = panel["time"] - treat_period
# Create interaction dummies (exclude k = -1 as reference)
for k in range(-(treat_period), n_periods - treat_period):
if k == -1:
continue # reference period
col_name = f"rel_{k}" if k < 0 else f"rel_p{k}"
panel[col_name] = (panel["treated"] * (panel["rel_time"] == k)).astype(int)
# Build formula: Y ~ sum of rel_k dummies + unit FE + time FE
rel_cols = [c for c in panel.columns if c.startswith("rel_")]
formula_es = "Y ~ " + " + ".join(rel_cols) + " + C(unit_id) + C(time)"
es_model = smf.ols(formula_es, data=panel).fit(
cov_type='cluster',
cov_kwds={'groups': panel['unit_id']}
)
# ── Extract event study coefficients ──
es_results = []
for col in sorted(rel_cols):
k_val = col.replace("rel_p", "").replace("rel_", "")
k_int = int(k_val)
coef = es_model.params.get(col, 0)
se = es_model.bse.get(col, 0)
es_results.append({"k": k_int, "coef": coef, "se": se})
es_df = pd.DataFrame(es_results).sort_values("k")
# Add the reference period (k = -1, coef = 0)
es_df = pd.concat([es_df, pd.DataFrame([{"k": -1, "coef": 0, "se": 0}])])
es_df = es_df.sort_values("k").reset_index(drop=True)
print("Event Study Coefficients:")
print(es_df.to_string(index=False))
If the pre-treatment coefficients ($\delta_k$ for $k < -1$) are jointly indistinguishable from zero, parallel trends is supported. A visual check:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Event Study Plot (conceptual)
delta_k
| * * * * * * * * * * <- post-treatment: ~3.0
3 |. . . . . . . . . . . . . . . *. . . . . .
| /
2 | /
| /
1 | |
| |
0 |--*--*--*--*--*--*--*--*--+------------------
| |
-1 | |
+--+--+--+--+--+--+--+--+-+--+--+--+--+--+--
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 ...
↑
treatment onset
Pre-trends: flat near zero Post-treatment: jumps to delta
2.5 Pre-Trends F-Test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scipy.stats as stats
# ── Joint F-test for pre-treatment coefficients ──
pre_cols = [c for c in rel_cols if c.startswith("rel_") and not c.startswith("rel_p")]
# Exclude rel_0 and positive (those start with rel_p)
pre_coefs = [es_model.params[c] for c in pre_cols if c in es_model.params]
pre_cov = es_model.cov_params().loc[pre_cols, pre_cols]
if len(pre_coefs) > 0:
pre_coefs_arr = np.array(pre_coefs)
pre_cov_arr = pre_cov.values
# Wald test: H0: all pre-treatment coefficients = 0
try:
wald_stat = pre_coefs_arr @ np.linalg.inv(pre_cov_arr) @ pre_coefs_arr
p_value = 1 - stats.chi2.cdf(wald_stat, df=len(pre_coefs))
print(f"Pre-trends Wald test: chi2 = {wald_stat:.2f}, "
f"df = {len(pre_coefs)}, p = {p_value:.4f}")
print("Parallel trends supported" if p_value > 0.05
else "WARNING: pre-trends detected")
except np.linalg.LinAlgError:
print("Covariance matrix singular — check collinearity")
3. Matching, IPW, and Doubly Robust in Practice
Parts 4 developed matching, inverse probability weighting (IPW), and the doubly robust estimator (AIPW) theoretically. Here we implement all three, starting from propensity score estimation and ending with the doubly robust estimator that is consistent if either the propensity model or the outcome model is correctly specified.
3.1 Propensity Score Estimation
The propensity score $e(\mathbf{x}) = P(W_i = 1 \mid \mathbf{X}_i = \mathbf{x})$ is the central object linking all three methods. We estimate it with logistic regression.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# ── Generate richer confounded data ──
np.random.seed(789)
N = 3000
X1 = np.random.randn(N)
X2 = np.random.randn(N)
X3 = np.random.randn(N)
X_mat = np.column_stack([X1, X2, X3])
# Treatment depends on X1 and X2
logit = 0.5 * X1 + 0.3 * X2
e_true = 1 / (1 + np.exp(-logit))
W = (np.random.rand(N) < e_true).astype(int)
# Outcome depends on X1, X3, and treatment (true tau = 4.0)
Y = 1.5 * X1 + 0.8 * X3 + 4.0 * W + np.random.randn(N)
obs_data = pd.DataFrame({
"X1": X1, "X2": X2, "X3": X3, "W": W, "Y": Y
})
# ── Estimate propensity score ──
scaler = StandardScaler()
X_scaled = scaler.fit_transform(obs_data[["X1", "X2", "X3"]])
ps_model = LogisticRegression(max_iter=1000, random_state=42)
ps_model.fit(X_scaled, obs_data["W"])
obs_data["ps"] = ps_model.predict_proba(X_scaled)[:, 1]
print(f"Propensity score range: [{obs_data['ps'].min():.3f}, {obs_data['ps'].max():.3f}]")
print(f"Mean PS (treated): {obs_data.loc[obs_data['W']==1, 'ps'].mean():.3f}")
print(f"Mean PS (control): {obs_data.loc[obs_data['W']==0, 'ps'].mean():.3f}")
3.2 Balance Diagnostics — Standardized Mean Differences
Before estimating anything, check covariate balance. The standardized mean difference (SMD) quantifies imbalance: values above 0.1 are traditionally considered problematic.
\[\text{SMD}_j = \frac{\bar{X}_{j,\text{treated}} - \bar{X}_{j,\text{control}}}{\sqrt{(s^2_{j,\text{treated}} + s^2_{j,\text{control}}) / 2}}\]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def compute_smd(data, covariates, treatment_col, weight_col=None):
"""Compute standardized mean differences for each covariate.
Args:
data: DataFrame with covariates and treatment
covariates: list of covariate column names
treatment_col: name of binary treatment column
weight_col: optional IPW weight column
Returns:
DataFrame with SMD for each covariate
"""
treated = data[data[treatment_col] == 1]
control = data[data[treatment_col] == 0]
results = []
for cov in covariates:
if weight_col is not None:
mean_t = np.average(treated[cov], weights=treated[weight_col])
mean_c = np.average(control[cov], weights=control[weight_col])
else:
mean_t = treated[cov].mean()
mean_c = control[cov].mean()
var_t = treated[cov].var()
var_c = control[cov].var()
smd = (mean_t - mean_c) / np.sqrt((var_t + var_c) / 2)
results.append({"covariate": cov, "mean_treated": mean_t,
"mean_control": mean_c, "SMD": smd})
return pd.DataFrame(results)
# ── Unadjusted balance ──
covariates = ["X1", "X2", "X3"]
balance_raw = compute_smd(obs_data, covariates, "W")
print("=== Unadjusted Balance ===")
print(balance_raw.to_string(index=False))
Expected output (approximate):
1
2
3
4
5
=== Unadjusted Balance ===
covariate mean_treated mean_control SMD
X1 0.33 -0.33 0.66 <-- confounded
X2 0.19 -0.18 0.37 <-- confounded
X3 0.01 -0.01 0.02 <-- balanced (not a confounder of W)
X1 and X2 are imbalanced (SMD > 0.1) because they drive treatment assignment. X3 is balanced because it does not affect treatment.
1
2
3
4
5
6
7
8
9
10
Conceptual Balance Plot (Love Plot)
Covariate |---- Unadjusted ----|---- IPW-Adjusted ----|
<---0.1---0---0.1--->
X1 * o
X2 * o
X3 *o
* = unadjusted SMD o = IPW-adjusted SMD
Goal: all |SMD| < 0.1 after adjustment
3.3 IPW Estimation
The IPW estimator (Horvitz-Thompson, Part 4) re-weights each observation by the inverse of its propensity score:
\[\hat{\tau}_{\text{IPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{W_i Y_i}{\hat{e}(\mathbf{x}_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(\mathbf{x}_i)} \right]\]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# ── IPW estimator ──
def ipw_ate(data, outcome_col, treatment_col, ps_col, trim=0.01):
"""Compute ATE via inverse probability weighting.
Args:
data: DataFrame
outcome_col: outcome column name
treatment_col: treatment column name
ps_col: propensity score column name
trim: trim extreme propensity scores for stability
Returns:
float: estimated ATE
"""
df = data.copy()
# Trim extreme propensity scores to avoid division by near-zero
df[ps_col] = df[ps_col].clip(lower=trim, upper=1 - trim)
W = df[treatment_col].values
Y = df[outcome_col].values
ps = df[ps_col].values
# Horvitz-Thompson estimator
tau_hat = np.mean(W * Y / ps - (1 - W) * Y / (1 - ps))
return tau_hat
tau_ipw = ipw_ate(obs_data, "Y", "W", "ps")
print(f"IPW estimate: {tau_ipw:.3f} (true tau = 4.0)")
# ── Check balance after IPW ──
obs_data["ipw_weight"] = np.where(
obs_data["W"] == 1,
1 / obs_data["ps"].clip(0.01, 0.99),
1 / (1 - obs_data["ps"].clip(0.01, 0.99))
)
balance_ipw = compute_smd(obs_data, covariates, "W", weight_col="ipw_weight")
print("\n=== IPW-Adjusted Balance ===")
print(balance_ipw.to_string(index=False))
3.4 AIPW — The Doubly Robust Estimator
The AIPW estimator (Robins, Rotnitzky, and Zhao, 1994) combines an outcome model $\hat{\mu}_w(\mathbf{x})$ with the propensity score $\hat{e}(\mathbf{x})$:
\[\hat{\tau}_{\text{AIPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(\mathbf{x}_i) - \hat{\mu}_0(\mathbf{x}_i) + \frac{W_i(Y_i - \hat{\mu}_1(\mathbf{x}_i))}{\hat{e}(\mathbf{x}_i)} - \frac{(1-W_i)(Y_i - \hat{\mu}_0(\mathbf{x}_i))}{1 - \hat{e}(\mathbf{x}_i)} \right]\]AIPW is consistent if either the propensity model or the outcome model is correctly specified — hence “doubly robust.”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn.linear_model import LinearRegression
def aipw_ate(data, outcome_col, treatment_col, covariates, ps_col, trim=0.01):
"""Compute ATE via augmented inverse probability weighting (AIPW).
Args:
data: DataFrame
outcome_col, treatment_col: column names
covariates: list of covariate column names
ps_col: propensity score column name
trim: propensity score trimming threshold
Returns:
float: estimated ATE
"""
df = data.copy()
df[ps_col] = df[ps_col].clip(lower=trim, upper=1 - trim)
W = df[treatment_col].values
Y = df[outcome_col].values
X = df[covariates].values
ps = df[ps_col].values
# Fit outcome models mu_1(x) and mu_0(x)
treated_idx = W == 1
control_idx = W == 0
model_1 = LinearRegression().fit(X[treated_idx], Y[treated_idx])
model_0 = LinearRegression().fit(X[control_idx], Y[control_idx])
mu1_hat = model_1.predict(X)
mu0_hat = model_0.predict(X)
# AIPW formula
tau_hat = np.mean(
(mu1_hat - mu0_hat)
+ W * (Y - mu1_hat) / ps
- (1 - W) * (Y - mu0_hat) / (1 - ps)
)
return tau_hat
tau_aipw = aipw_ate(obs_data, "Y", "W", covariates, "ps")
print(f"AIPW estimate: {tau_aipw:.3f} (true tau = 4.0)")
| Estimator | What It Requires | Double Robustness | Variance |
|---|---|---|---|
| Regression adjustment | Correct outcome model $\hat{\mu}_w$ | No | Low if model correct |
| IPW | Correct propensity model $\hat{e}$ | No | Can be high (extreme weights) |
| AIPW | Both models, but only one needs to be correct | Yes | Semiparametric efficient |
4. DML and Causal Forests in Practice
Double/Debiased Machine Learning (Chernozhukov et al., 2018) allows any ML model for nuisance estimation while maintaining valid inference for the causal parameter. Causal forests (Wager and Athey, 2018) extend this to heterogeneous effects.
4.1 CausalForestDML — End-to-End
EconML’s CausalForestDML combines DML’s cross-fitting with a causal forest for CATE estimation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
# ── Generate data with heterogeneous effects ──
np.random.seed(2026)
N = 5000
X1 = np.random.randn(N)
X2 = np.random.randn(N)
X3 = np.random.uniform(0, 1, N)
X_features = np.column_stack([X1, X2, X3])
# Treatment: depends on X1
T = (np.random.rand(N) < 1 / (1 + np.exp(-0.5 * X1))).astype(int)
# True CATE: tau(x) = 3 + 2*X1 (heterogeneous in X1, not X2 or X3)
tau_true = 3 + 2 * X1
Y = 1.0 * X1 + 0.5 * X2 + tau_true * T + np.random.randn(N)
print(f"True ATE: {tau_true.mean():.3f}")
print(f"True CATE range: [{tau_true.min():.2f}, {tau_true.max():.2f}]")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ── Fit CausalForestDML ──
est = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=100, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
),
n_estimators=200,
min_samples_leaf=10,
random_state=42
)
est.fit(Y, T, X=X_features)
# ── ATE with confidence interval ──
ate_point = est.ate(X=X_features)
ate_inf = est.ate_inference(X=X_features)
print(f"Estimated ATE: {ate_point:.3f}")
print(f"ATE 95% CI: [{ate_inf.conf_int()[0]:.3f}, {ate_inf.conf_int()[1]:.3f}]")
# ── CATE for individual units ──
cate_hat = est.effect(X_features)
print(f"\nCATE summary:")
print(f" Mean: {cate_hat.mean():.3f}")
print(f" Std: {cate_hat.std():.3f}")
print(f" Range: [{cate_hat.min():.2f}, {cate_hat.max():.2f}]")
4.2 CATE Inspection by Subgroup
The practical payoff of heterogeneous effect estimation: identify who benefits most from treatment.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ── CATE by subgroup: split on X1 (the true effect modifier) ──
X1_quantiles = pd.qcut(X1, q=5, labels=["Q1 (low)", "Q2", "Q3", "Q4", "Q5 (high)"])
cate_by_group = pd.DataFrame({
"X1_quantile": X1_quantiles,
"CATE_hat": cate_hat,
"CATE_true": tau_true
})
summary = cate_by_group.groupby("X1_quantile").agg(
CATE_estimated=("CATE_hat", "mean"),
CATE_true=("CATE_true", "mean"),
n=("CATE_hat", "count")
).round(3)
print("\n=== CATE by X1 Quantile ===")
print(summary.to_string())
Expected output (approximate):
1
2
3
4
5
6
7
8
=== CATE by X1 Quantile ===
CATE_estimated CATE_true n
X1_quantile
Q1 (low) 0.35 0.36 1000
Q2 1.78 1.78 1000
Q3 2.97 3.00 1000
Q4 4.18 4.22 1000
Q5 (high) 5.55 5.64 1000
The causal forest recovers the monotonic relationship between X1 and the treatment effect. Units in Q5 benefit roughly 15 times more than units in Q1.
4.3 Feature Importance for Heterogeneity
1
2
3
4
5
6
7
8
9
# ── Which features drive heterogeneity? ──
feature_names = ["X1", "X2", "X3"]
importances = est.feature_importances_
print("\n=== Feature Importance for Treatment Effect Heterogeneity ===")
for name, imp in sorted(zip(feature_names, importances),
key=lambda x: -x[1]):
bar = "#" * int(imp * 50)
print(f" {name:4s}: {imp:.3f} {bar}")
Expected output:
1
2
3
4
=== Feature Importance for Treatment Effect Heterogeneity ===
X1 : 0.82 #########################################
X2 : 0.10 #####
X3 : 0.08 ####
X1 dominates because it is the only true effect modifier. X2 and X3 receive small importance from noise-fitting.
4.4 Meta-Learners via CausalML
For comparison, we can estimate CATE with meta-learners (Kunzel et al., 2019). The S-learner, T-learner, and X-learner offer different bias-variance tradeoffs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from causalml.inference.meta import (
BaseSRegressor,
BaseTRegressor,
BaseXRegressor
)
from sklearn.ensemble import GradientBoostingRegressor as GBR
# ── S-Learner: single model with treatment as a feature ──
s_learner = BaseSRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_s = s_learner.estimate_individual_effect(X_features, T, Y)
# ── T-Learner: separate models for treated and control ──
t_learner = BaseTRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_t = t_learner.estimate_individual_effect(X_features, T, Y)
# ── X-Learner: impute counterfactuals, then regress ──
x_learner = BaseXRegressor(learner=GBR(n_estimators=100, random_state=42))
cate_x = x_learner.estimate_individual_effect(X_features, T, Y)
print("=== Meta-Learner ATE Comparison ===")
print(f" S-Learner ATE: {cate_s.mean():.3f}")
print(f" T-Learner ATE: {cate_t.mean():.3f}")
print(f" X-Learner ATE: {cate_x.mean():.3f}")
print(f" CausalForest ATE: {cate_hat.mean():.3f}")
print(f" True ATE: {tau_true.mean():.3f}")
| Meta-Learner | Strategy | Best When |
|---|---|---|
| S-Learner | Single model, treatment as feature | Large N, small heterogeneity |
| T-Learner | Separate models per treatment arm | Balanced groups, distinct response surfaces |
| X-Learner | Cross-impute counterfactuals | Unequal group sizes |
| Causal Forest | Honest splitting on treatment effect | Moderate to large N, tree-compatible heterogeneity |
5. Causal Discovery in Practice
Parts 2 and 5 introduced causal discovery — learning the DAG from data. Here we apply the PC algorithm and compare the learned graph to a known ground truth.
5.1 The PC Algorithm
The PC algorithm (Spirtes, Glymour, and Scheines, 2000) starts with a complete undirected graph and removes edges via conditional independence tests.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz
# ── Generate data from a known DAG: X1 -> X2 -> X3, X1 -> X3 ──
np.random.seed(42)
N_disc = 2000
X1_disc = np.random.randn(N_disc)
X2_disc = 0.8 * X1_disc + np.random.randn(N_disc) * 0.5
X3_disc = 0.5 * X1_disc + 0.6 * X2_disc + np.random.randn(N_disc) * 0.5
data_disc = np.column_stack([X1_disc, X2_disc, X3_disc])
labels = ["X1", "X2", "X3"]
# ── Run PC algorithm ──
cg = pc(data_disc, alpha=0.05, indep_test=fisherz, node_names=labels)
# ── Print adjacency matrix ──
adj = cg.G.graph
print("Adjacency matrix (PC algorithm output):")
print(f"Labels: {labels}")
print(adj)
The output adjacency matrix encodes:
1in position(i, j)with-1in(j, i): directed edge $i \to j$1in both(i, j)and(j, i): undirected edge (Markov equivalence)
1
2
3
4
5
6
7
8
9
10
True DAG vs. PC Output (conceptual)
True DAG: PC Output (CPDAG):
X1 ──> X2 X1 ──> X2
│ │ │ │
│ ▼ │ ▼
└─────> X3 └─────> X3
In this case, PC recovers the correct structure
because the v-structure at X3 orients both edges.
5.2 Comparing Discovery Algorithms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from causallearn.search.ScoreBased.GES import ges
# ── GES (Greedy Equivalence Search) ──
record = ges(data_disc, score_func='local_score_BIC')
adj_ges = record['G'].graph
print("\nGES adjacency matrix:")
print(adj_ges)
# ── Compare: count edge agreements ──
def compare_adjacency(adj1, adj2, labels):
"""Compare two adjacency matrices and report matches."""
n = len(labels)
agree = 0
disagree = 0
for i in range(n):
for j in range(i + 1, n):
edge1 = (adj1[i, j] != 0) or (adj1[j, i] != 0)
edge2 = (adj2[i, j] != 0) or (adj2[j, i] != 0)
if edge1 == edge2:
agree += 1
else:
disagree += 1
print(f" Disagreement: {labels[i]} -- {labels[j]}: "
f"PC={edge1}, GES={edge2}")
print(f"Edge agreement: {agree}/{agree + disagree}")
compare_adjacency(adj, adj_ges, labels)
| Algorithm | Type | Handles Latent Confounders | Output |
|---|---|---|---|
| PC | Constraint-based | No (assumes causal sufficiency) | CPDAG |
| FCI | Constraint-based | Yes | PAG |
| GES | Score-based (BIC) | No | CPDAG |
Pitfall: Causal discovery algorithms return Markov equivalence classes (CPDAGs), not unique DAGs. Multiple DAGs can encode the same conditional independence relations. The orientation of some edges remains ambiguous without additional assumptions (e.g., non-Gaussianity for LiNGAM, temporal ordering).
6. Refutation and Sensitivity Analysis
No estimate is credible without stress-testing. This section implements a comprehensive refutation workflow and introduces the E-value for sensitivity to unmeasured confounding.
6.1 DoWhy Refutation — Full Walkthrough
We return to the DoWhy model from Section 1 and run all four refuters with interpretation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# ── Rebuild the DoWhy model (from Section 1) ──
np.random.seed(42)
N = 2000
X = np.random.randn(N)
W = (np.random.rand(N) < 1 / (1 + np.exp(-X))).astype(int)
Y = 2 * X + 5 * W + np.random.randn(N)
data = pd.DataFrame({"X": X, "W": W, "Y": Y})
model = CausalModel(data=data, treatment="W", outcome="Y",
common_causes=["X"])
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression"
)
print(f"Original estimate: {estimate.value:.3f}")
# ── Refutation 1: Random common cause ──
print("\n--- Refutation 1: Random Common Cause ---")
ref1 = model.refute_estimate(
identified_estimand, estimate,
method_name="random_common_cause"
)
print(ref1)
# ── Refutation 2: Placebo treatment ──
print("\n--- Refutation 2: Placebo Treatment ---")
ref2 = model.refute_estimate(
identified_estimand, estimate,
method_name="placebo_treatment_refuter",
placebo_type="permute"
)
print(ref2)
# ── Refutation 3: Data subset ──
print("\n--- Refutation 3: Data Subset ---")
ref3 = model.refute_estimate(
identified_estimand, estimate,
method_name="data_subset_refuter",
subset_fraction=0.8
)
print(ref3)
# ── Refutation 4: Bootstrap ──
print("\n--- Refutation 4: Bootstrap ---")
ref4 = model.refute_estimate(
identified_estimand, estimate,
method_name="bootstrap_refuter",
num_simulations=200
)
print(ref4)
6.2 Interpreting Refutation Results
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Refutation Decision Tree
Random common cause:
Estimate changes significantly? ──> YES ──> Possible unmeasured confounding
└── NO ──> Passes (continue)
Placebo treatment:
Placebo effect near zero? ──> YES ──> Passes (effect is treatment-specific)
└── NO ──> Specification error likely
Data subset:
Estimate within original CI? ──> YES ──> Passes (stable)
└── NO ──> Influential observations present
Bootstrap:
CI is narrow and covers original?──> YES ──> Passes (precise)
└── NO ──> High variance or instability
| Refutation | Tests For | Failure Implication |
|---|---|---|
| Random common cause | Sensitivity to unobserved confounders | Model may be missing a key confounder |
| Placebo treatment | Specificity of the treatment | Effect may be an artifact of model specification |
| Data subset | Influence of individual observations | Estimate driven by outliers or small subgroup |
| Bootstrap | Finite-sample stability | Large variance makes point estimate unreliable |
6.3 The E-Value — Sensitivity to Unmeasured Confounding
The E-value (VanderWeele and Ding, 2017) asks: how strong would an unmeasured confounder need to be — in terms of its associations with both treatment and outcome — to explain away the observed effect?
For a risk ratio $\text{RR}$:
\[E\text{-value} = \text{RR} + \sqrt{\text{RR} \times (\text{RR} - 1)}\]A large E-value means only a very strong unmeasured confounder could nullify the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def compute_e_value(rr):
"""Compute the E-value for a given risk ratio.
The E-value is the minimum strength of association that an
unmeasured confounder would need to have with both treatment
and outcome (on the RR scale) to fully explain away the
observed effect.
Args:
rr: observed risk ratio (must be >= 1; if < 1, invert)
Returns:
float: E-value
"""
if rr < 1:
rr = 1 / rr
return rr + np.sqrt(rr * (rr - 1))
# ── Example: convert our ATE to an approximate risk ratio ──
# For a continuous outcome, we can use the approximate conversion:
# RR approx exp(tau / SD(Y))
sd_y = obs_data["Y"].std()
tau_est = 4.0 # our estimated ATE
rr_approx = np.exp(tau_est / sd_y)
e_val = compute_e_value(rr_approx)
print(f"Approximate RR: {rr_approx:.2f}")
print(f"E-value: {e_val:.2f}")
print(f"\nInterpretation: An unmeasured confounder would need to be")
print(f"associated with both treatment and outcome by a factor of")
print(f"at least {e_val:.1f} (on the RR scale) to explain away the effect.")
6.4 Putting It All Together — The Refutation Checklist
Every causal analysis in this series should report the following:
| Checklist Item | Status | Tool |
|---|---|---|
| Propensity score overlap checked | [ ] | Histogram / trimming |
| Covariate balance achieved (SMD < 0.1) | [ ] | compute_smd() |
| Random common cause refutation passed | [ ] | DoWhy |
| Placebo treatment refutation passed | [ ] | DoWhy |
| Data subset refutation passed | [ ] | DoWhy |
| E-value reported | [ ] | compute_e_value() |
| Pre-trends test passed (if DiD) | [ ] | F-test / event study |
Library Reference
The complete mapping from method to library:
| Method | Library | Key Class / Function | Series Part |
|---|---|---|---|
| Full pipeline (Model-Identify-Estimate-Refute) | DoWhy | CausalModel | 6 (Sec. 1) |
| DiD with TWFE | statsmodels | smf.ols() with C(unit_id) | 6 (Sec. 2) |
| PPML for counts | statsmodels | sm.GLM(family=Poisson()) | 6 (Sec. 2) |
| Propensity score estimation | scikit-learn | LogisticRegression | 6 (Sec. 3) |
| IPW | Custom / DoWhy | ipw_ate() above | 6 (Sec. 3) |
| AIPW (doubly robust) | Custom / DoWhy | aipw_ate() above | 6 (Sec. 3) |
| DML + Causal Forests | EconML | CausalForestDML | 6 (Sec. 4) |
| Meta-learners (S/T/X) | CausalML | BaseSRegressor, BaseTRegressor, BaseXRegressor | 6 (Sec. 4) |
| PC algorithm | causal-learn | pc() | 6 (Sec. 5) |
| GES | causal-learn | ges() | 6 (Sec. 5) |
| Refutation tests | DoWhy | model.refute_estimate() | 6 (Sec. 6) |
| E-value | Custom | compute_e_value() above | 6 (Sec. 6) |
1
2
3
Installation (one command):
pip install dowhy econml causalml causal-learn statsmodels scikit-learn
Summary
This post translated five posts of theory into working code. The key implementation patterns:
The DoWhy pipeline enforces discipline. Model your assumptions as a graph before estimating anything. Identify the estimand before choosing an estimator. Refute the result before reporting it. The four-step structure is not bureaucracy — it is the difference between a causal claim and a correlation dressed up in causal language.
DiD requires more than a regression coefficient. Cluster standard errors at the unit level. Run an event study to visualize pre-trends. Test parallel trends with a joint F-test. Use PPML for count outcomes.
Always check balance before trusting an IPW or matching estimate. The SMD table is the first diagnostic, not the last. AIPW provides insurance through double robustness, but neither model being correct is guaranteed — check both.
Causal forests reveal who benefits, not just whether treatment works. Feature importance for heterogeneity is a fundamentally different quantity from predictive feature importance. The former asks “which covariates modify the treatment effect?” — the latter asks “which covariates predict the outcome?”
Refutation is not optional. Every estimate in this post passed four refutation tests plus an E-value assessment. A result without refutation is a claim without evidence.
Part 7 takes the next step: can an LLM-driven agent automate this entire pipeline — from reading a dataset to proposing a causal graph, selecting an estimator, running the code, and interpreting the refutation results?
References
- Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics, 119(1), 249-275.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1), C1-C68.
- Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156-4165.
- Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846-866.
- Santos Silva, J. M. C., & Tenreyro, S. (2006). The log of gravity. Review of Economics and Statistics, 88(4), 641-658.
- Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216.
- Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd ed.). MIT Press.
- VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.
- Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.