Bayesian Inverse Probability Weighting

Summary

Inverse probability weighting (IPW) cannot be used naively with Bayesian inference because weights are not part of the likelihood. The Liao-Zigler (2020) method resolves this by treating propensity scores as a parameter , drawing samples of propensity scores from the posterior of a Bayesian treatment model, computing weights for each draw, running the outcome model times, and combining results with Rubin’s rules. This propagates treatment-model uncertainty into the ATE estimate.

Overview

The problem: We want to use both Bayesian inference (posterior distributions, no null hypotheses) and inverse probability weighting (to close DAG backdoor paths). These seem incompatible.

Why IPW doesn’t fit Bayes: The ATE estimand is:

To estimate this Bayesianly, we compute . IPTWs have no place in this equation — weights are not part of the data-generating likelihood.

The solution (Liao & Zigler 2020): Marginalize over the posterior distribution of propensity scores.

Standard Frequentist IPW

Inverse Probability of Treatment Weights (IPTW)

For a binary treatment , given propensity score :

Using these weights in an outcome model creates pseudo-populations: treat and control groups with matched covariate distributions, as if randomized.

Frequentist IPW: Mosquito Net and Malaria Risk

Setup: Observational data on mosquito net use () and malaria risk (, 0-100). Confounders: income, health, temperature. True ATE = −10 (nets reduce malaria risk by 10 points).

library(tidyverse); library(brms); library(broom)
 
# Step 1: Treatment model — predict net use from confounders
model_treatment_freq <- glm(net ~ income + temperature + health,
                            data = nets, family = binomial())
 
# Steps 2-3: Propensity scores → IPTW
nets_with_weights <- augment(model_treatment_freq, nets,
                             type.predict = "response") %>%
  rename(propensity = .fitted) %>%
  mutate(iptw = (net_num / propensity) + ((1 - net_num) / (1 - propensity)))
 
# Step 4: Outcome model weighted by IPTW
model_outcome_freq <- lm(malaria_risk ~ net,
                         data = nets_with_weights, weights = iptw)
tidy(model_outcome_freq)
# netTRUE: estimate = -10.1 ✓

The IPTW successfully recovers the −10 ATE by creating pseudo-populations of comparable treated and untreated individuals.

What Pseudo-Populations Do

Visualizing the weighted propensity score distributions shows that IPTW makes the treated and untreated groups look alike:

  • Treated, low propensity → high weight (surprising to be treated)
  • Untreated, high propensity → high weight (surprising to not be treated)
  • Result: both groups have similar propensity distributions → comparable

Why Bayesian IPW is Hard

The Fundamental Problem

Robins, Hernán, and Wasserman (2015) show that propensity scores cannot be incorporated into standard Bayesian inference because:

  1. The Bayesian estimand is
  2. Propensity scores (and weights) have no role in the likelihood
  3. We cannot set a prior on a weight parameter — there is no weight parameter in the model

Conclusion: “Bayesian inference must ignore the propensity score” (Robins et al. 2015)

Liao-Zigler Two-Stage Method

Instead of using a single set of weights, treat propensity scores as a parameter and marginalize over them:

Practical algorithm:

  1. Fit a Bayesian treatment model to get posterior propensity scores
  2. Draw samples of propensity scores from the posterior
  3. For each draw : compute IPTW and run outcome model → ATE estimate
  4. Combine using Rubin’s rules:

This propagates treatment-model uncertainty into the final ATE estimate.

Bayesian IPW: Mosquito Nets (R/brms)

# Step 1: Bayesian treatment model
model_treatment <- brm(
  bf(net ~ income + temperature + health, decomp = "QR"),
  family = bernoulli(),  # logistic regression
  data = nets,
  chains = 4, cores = 4, iter = 1000,
  seed = 1234, backend = "cmdstanr"
)
 
# Step 2: Extract 2000 posterior draws of propensity scores
pred_probs_chains <- posterior_epred(model_treatment)
# dim: [2000 draws × 1752 people]
 
# Step 3: For each of the K=2000 draws, compute IPTW and run outcome model
pred_probs_nested <- pred_probs_chains %>%
  as_tibble(.name_repair = "unique") %>%
  mutate(draw = 1:n()) %>%
  pivot_longer(-draw, names_to = "row", values_to = "prob") %>%
  mutate(row = as.numeric(str_remove(row, "..."))) %>%
  group_by(draw) %>%
  nest() %>% ungroup()
 
outcome_models <- pred_probs_nested %>%
  mutate(outcome_model = map(data, ~{
    df <- bind_cols(nets, .) %>%
      mutate(iptw = (net_num / prob) + ((1 - net_num) / (1 - prob)))
    lm(malaria_risk ~ net, data = df, weights = iptw)
  })) %>%
  mutate(
    tidied  = map(outcome_model, ~tidy(.)),
    ate     = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(estimate)),
    ate_se  = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(std.error))
  )
 
# Step 4: Combine with Rubin's rules
rubin_se <- function(ates, sigmas) sqrt(mean(sigmas^2) + var(ates))
 
mean(outcome_models$ate)            # ATE ≈ -10.1 (matches frequentist)
rubin_se(outcome_models$ate, outcome_models$ate_se)  # SE ≈ 1.02 (larger!)

Key insight: The combined SE (1.02) is larger than the naive mean SE (0.659) because Rubin’s rules add the variance of the ATEs across draws — this properly accounts for treatment-model uncertainty.

Rubin’s Rules for Combining Results

Rubin's Rules

When combining results from models fitted to slightly different datasets (or draws), the combined ATE and standard error are:

The between-draw variance captures the uncertainty from the treatment model. Simply averaging SEs is incorrect; Rubin’s rules give the proper pooled SE.

Limitations and Open Questions

  1. Is the result truly Bayesian? The outcome model here is frequentist (OLS). The distribution of ATEs is a mathematical transformation of the treatment model’s posterior, but the uncertainty in the outcome model is not quantified Bayesianly. Strictly speaking, this is only quasi-Bayesian.

  2. Full Bayesian outcome model: Running brm() 2,000 times is computationally prohibitive. A fully Bayesian solution would use one brm() run where the weights change per iteration (possible via custom brms Stan code, as noted by Jordan Nafa in the follow-up post).

  3. Efficiency: This approach is equivalent to bootstrapping the treatment uncertainty. The interpretation of the resulting “posterior-like” distribution as a credible interval is debatable.

Comparison: Approaches to Bayesian Causal Inference

MethodTreatment of ConfoundersFully Bayesian?
Standard IPW (frequentist)Single propensity scoreNo
Liao-ZiglerPosterior propensity scores, Rubin’s rulesQuasi-Bayesian
BART + propensity scoresBART model; see Nonparametric Causal InferenceYes
Structural/outcome modelDirect Bayesian regression on outcomesYes (but requires correct model)

Connections

See Also