Inverse probability weighting (IPW) cannot be used naively with Bayesian inference because weights are not part of the likelihood. The Liao-Zigler (2020) method resolves this by treating propensity scores as a parameter ν, drawing K samples of propensity scores from the posterior of a Bayesian treatment model, computing weights for each draw, running the outcome model K times, and combining results with Rubin’s rules. This propagates treatment-model uncertainty into the ATE estimate.
Overview
The problem: We want to use both Bayesian inference (posterior distributions, no null hypotheses) and inverse probability weighting (to close DAG backdoor paths). These seem incompatible.
Why IPW doesn’t fit Bayes: The ATE estimand is:
ΔATE=E[E(Yi∣Ti=1,Xi)−E(Yi∣Ti=0,Xi)]
To estimate this Bayesianly, we compute P[Δ∣(T,X,Y)]∝P[(T,X,Y)∣Δ]×P[Δ]. IPTWs have no place in this equation — weights are not part of the data-generating likelihood.
The solution (Liao & Zigler 2020): Marginalize over the posterior distribution of propensity scores.
Standard Frequentist IPW
Inverse Probability of Treatment Weights (IPTW)
For a binary treatment T∈{0,1}, given propensity score e^(X)=P(T=1∣X):
IPTWi=e^(Xi)Ti+1−e^(Xi)1−Ti
Using these weights in an outcome model creates pseudo-populations: treat and control groups with matched covariate distributions, as if randomized.
Frequentist IPW: Mosquito Net and Malaria Risk
Setup: Observational data on mosquito net use (T) and malaria risk (Y, 0-100). Confounders: income, health, temperature. True ATE = −10 (nets reduce malaria risk by 10 points).
library(tidyverse); library(brms); library(broom)# Step 1: Treatment model — predict net use from confoundersmodel_treatment_freq <- glm(net ~ income + temperature + health, data = nets, family = binomial())# Steps 2-3: Propensity scores → IPTWnets_with_weights <- augment(model_treatment_freq, nets, type.predict = "response") %>% rename(propensity = .fitted) %>% mutate(iptw = (net_num / propensity) + ((1 - net_num) / (1 - propensity)))# Step 4: Outcome model weighted by IPTWmodel_outcome_freq <- lm(malaria_risk ~ net, data = nets_with_weights, weights = iptw)tidy(model_outcome_freq)# netTRUE: estimate = -10.1 ✓
The IPTW successfully recovers the −10 ATE by creating pseudo-populations of comparable treated and untreated individuals.
What Pseudo-Populations Do
Visualizing the weighted propensity score distributions shows that IPTW makes the treated and untreated groups look alike:
Treated, low propensity → high weight (surprising to be treated)
Untreated, high propensity → high weight (surprising to not be treated)
Result: both groups have similar propensity distributions → comparable
Why Bayesian IPW is Hard
The Fundamental Problem
Robins, Hernán, and Wasserman (2015) show that propensity scores cannot be incorporated into standard Bayesian inference because:
The Bayesian estimand is P[Δ∣(T,X,Y)]∝P[(T,X,Y)∣Δ]×P[Δ]
Propensity scores (and weights) have no role in the likelihood P[(T,X,Y)∣Δ]
We cannot set a prior on a weight parameter — there is no weight parameter in the model
Conclusion: “Bayesian inference must ignore the propensity score” (Robins et al. 2015)
Liao-Zigler Method: A Legal Bayesian Approach
Liao-Zigler Two-Stage Method
Instead of using a single set of weights, treat propensity scores as a parameter ν and marginalize over them:
ATE without νf(Δ∣T,X,Y)=∫νoutcome model with νf(Δ∣T,X,Y,ν)treatment modelf(ν∣T,X)dν
Practical algorithm:
Fit a Bayesian treatment model ν∣T,X to get posterior propensity scores
Draw K samples of propensity scores from the posterior
For each draw k: compute IPTW and run outcome model → ATE estimate Δ^k
Combine using Rubin’s rules: Δˉ=K1∑kΔ^k
This propagates treatment-model uncertainty into the final ATE estimate.
Bayesian IPW: Mosquito Nets (R/brms)
# Step 1: Bayesian treatment modelmodel_treatment <- brm( bf(net ~ income + temperature + health, decomp = "QR"), family = bernoulli(), # logistic regression data = nets, chains = 4, cores = 4, iter = 1000, seed = 1234, backend = "cmdstanr")# Step 2: Extract 2000 posterior draws of propensity scorespred_probs_chains <- posterior_epred(model_treatment)# dim: [2000 draws × 1752 people]# Step 3: For each of the K=2000 draws, compute IPTW and run outcome modelpred_probs_nested <- pred_probs_chains %>% as_tibble(.name_repair = "unique") %>% mutate(draw = 1:n()) %>% pivot_longer(-draw, names_to = "row", values_to = "prob") %>% mutate(row = as.numeric(str_remove(row, "..."))) %>% group_by(draw) %>% nest() %>% ungroup()outcome_models <- pred_probs_nested %>% mutate(outcome_model = map(data, ~{ df <- bind_cols(nets, .) %>% mutate(iptw = (net_num / prob) + ((1 - net_num) / (1 - prob))) lm(malaria_risk ~ net, data = df, weights = iptw) })) %>% mutate( tidied = map(outcome_model, ~tidy(.)), ate = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(estimate)), ate_se = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(std.error)) )# Step 4: Combine with Rubin's rulesrubin_se <- function(ates, sigmas) sqrt(mean(sigmas^2) + var(ates))mean(outcome_models$ate) # ATE ≈ -10.1 (matches frequentist)rubin_se(outcome_models$ate, outcome_models$ate_se) # SE ≈ 1.02 (larger!)
Key insight: The combined SE (1.02) is larger than the naive mean SE (0.659) because Rubin’s rules add the variance of the ATEs across draws — this properly accounts for treatment-model uncertainty.
Rubin’s Rules for Combining Results
Rubin's Rules
When combining results from K models fitted to slightly different datasets (or draws), the combined ATE and standard error are:
The between-draw variance captures the uncertainty from the treatment model. Simply averaging SEs is incorrect; Rubin’s rules give the proper pooled SE.
Limitations and Open Questions
Is the result truly Bayesian? The outcome model here is frequentist (OLS). The distribution of ATEs is a mathematical transformation of the treatment model’s posterior, but the uncertainty in the outcome model is not quantified Bayesianly. Strictly speaking, this is only quasi-Bayesian.
Full Bayesian outcome model: Running brm() 2,000 times is computationally prohibitive. A fully Bayesian solution would use one brm() run where the weights change per iteration (possible via custom brms Stan code, as noted by Jordan Nafa in the follow-up post).
Efficiency: This approach is equivalent to bootstrapping the treatment uncertainty. The interpretation of the resulting “posterior-like” distribution as a credible interval is debatable.
Comparison: Approaches to Bayesian Causal Inference
Differences-in-Differences — Another quasi-experimental approach; IPW adjusts for observed confounders while DiD adjusts for time-invariant unobservables