Bayesian Inverse Probability Weighting

Summary

Inverse probability weighting (IPW) cannot be used naively with Bayesian inference because weights are not part of the likelihood. The Liao-Zigler (2020) method resolves this by treating propensity scores as a parameter $ν$ , drawing $K$ samples of propensity scores from the posterior of a Bayesian treatment model, computing weights for each draw, running the outcome model $K$ times, and combining results with Rubin’s rules. This propagates treatment-model uncertainty into the ATE estimate.

Overview

The problem: We want to use both Bayesian inference (posterior distributions, no null hypotheses) and inverse probability weighting (to close DAG backdoor paths). These seem incompatible.

Why IPW doesn’t fit Bayes: The ATE estimand is:

Δ_{ATE} = E [E (Y_{i} ∣ T_{i} = 1, X_{i}) - E (Y_{i} ∣ T_{i} = 0, X_{i})]

To estimate this Bayesianly, we compute $P [Δ ∣ (T, X, Y)] \propto P [(T, X, Y) ∣ Δ] \times P [Δ]$ . IPTWs have no place in this equation — weights are not part of the data-generating likelihood.

The solution (Liao & Zigler 2020): Marginalize over the posterior distribution of propensity scores.

Standard Frequentist IPW

Inverse Probability of Treatment Weights (IPTW)

For a binary treatment $T \in {0, 1}$ , given propensity score $\overset{e}{^} (X) = P (T = 1 ∣ X)$ :
$IPTW_{i} = \frac{T _{i}}{e ^ ( X _{i} )} + \frac{1 - T _{i}}{1 - e ^ ( X _{i} )}$
Using these weights in an outcome model creates pseudo-populations: treat and control groups with matched covariate distributions, as if randomized.

Frequentist IPW: Mosquito Net and Malaria Risk

Setup: Observational data on mosquito net use ( $T$ ) and malaria risk ( $Y$ , 0-100). Confounders: income, health, temperature. True ATE = −10 (nets reduce malaria risk by 10 points).

library(tidyverse); library(brms); library(broom)
 
# Step 1: Treatment model — predict net use from confounders
model_treatment_freq <- glm(net ~ income + temperature + health,
                            data = nets, family = binomial())
 
# Steps 2-3: Propensity scores → IPTW
nets_with_weights <- augment(model_treatment_freq, nets,
                             type.predict = "response") %>%
  rename(propensity = .fitted) %>%
  mutate(iptw = (net_num / propensity) + ((1 - net_num) / (1 - propensity)))
 
# Step 4: Outcome model weighted by IPTW
model_outcome_freq <- lm(malaria_risk ~ net,
                         data = nets_with_weights, weights = iptw)
tidy(model_outcome_freq)
# netTRUE: estimate = -10.1 ✓

The IPTW successfully recovers the −10 ATE by creating pseudo-populations of comparable treated and untreated individuals.

What Pseudo-Populations Do

Visualizing the weighted propensity score distributions shows that IPTW makes the treated and untreated groups look alike:

Treated, low propensity → high weight (surprising to be treated)
Untreated, high propensity → high weight (surprising to not be treated)
Result: both groups have similar propensity distributions → comparable

Why Bayesian IPW is Hard

The Fundamental Problem

Robins, Hernán, and Wasserman (2015) show that propensity scores cannot be incorporated into standard Bayesian inference because:

The Bayesian estimand is $P [Δ ∣ (T, X, Y)] \propto P [(T, X, Y) ∣ Δ] \times P [Δ]$

Propensity scores (and weights) have no role in the likelihood $P [(T, X, Y) ∣ Δ]$

We cannot set a prior on a weight parameter — there is no weight parameter in the model

Conclusion: “Bayesian inference must ignore the propensity score” (Robins et al. 2015)

Liao-Zigler Method: A Legal Bayesian Approach

Liao-Zigler Two-Stage Method

Instead of using a single set of weights, treat propensity scores as a parameter $ν$ and marginalize over them:
$ATE without ν f (Δ ∣ T, X, Y) = \int_{ν} outcome model with ν f (Δ ∣ T, X, Y, ν) treatment model f (ν ∣ T, X) d ν$
Practical algorithm:

Fit a Bayesian treatment model $ν ∣ T, X$ to get posterior propensity scores

Draw $K$ samples of propensity scores from the posterior

For each draw $k$ : compute IPTW and run outcome model → ATE estimate $\hat{Δ}_{k}$

Combine using Rubin’s rules: $\overset{ˉ}{Δ} = \frac{1}{K} \sum_{k} \hat{Δ}_{k}$

This propagates treatment-model uncertainty into the final ATE estimate.

Bayesian IPW: Mosquito Nets (R/brms)

# Step 1: Bayesian treatment model
model_treatment <- brm(
  bf(net ~ income + temperature + health, decomp = "QR"),
  family = bernoulli(),  # logistic regression
  data = nets,
  chains = 4, cores = 4, iter = 1000,
  seed = 1234, backend = "cmdstanr"
)
 
# Step 2: Extract 2000 posterior draws of propensity scores
pred_probs_chains <- posterior_epred(model_treatment)
# dim: [2000 draws × 1752 people]
 
# Step 3: For each of the K=2000 draws, compute IPTW and run outcome model
pred_probs_nested <- pred_probs_chains %>%
  as_tibble(.name_repair = "unique") %>%
  mutate(draw = 1:n()) %>%
  pivot_longer(-draw, names_to = "row", values_to = "prob") %>%
  mutate(row = as.numeric(str_remove(row, "..."))) %>%
  group_by(draw) %>%
  nest() %>% ungroup()
 
outcome_models <- pred_probs_nested %>%
  mutate(outcome_model = map(data, ~{
    df <- bind_cols(nets, .) %>%
      mutate(iptw = (net_num / prob) + ((1 - net_num) / (1 - prob)))
    lm(malaria_risk ~ net, data = df, weights = iptw)
  })) %>%
  mutate(
    tidied  = map(outcome_model, ~tidy(.)),
    ate     = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(estimate)),
    ate_se  = map_dbl(tidied, ~filter(., term == "netTRUE") %>% pull(std.error))
  )
 
# Step 4: Combine with Rubin's rules
rubin_se <- function(ates, sigmas) sqrt(mean(sigmas^2) + var(ates))
 
mean(outcome_models$ate)            # ATE ≈ -10.1 (matches frequentist)
rubin_se(outcome_models$ate, outcome_models$ate_se)  # SE ≈ 1.02 (larger!)

Key insight: The combined SE (1.02) is larger than the naive mean SE (0.659) because Rubin’s rules add the variance of the ATEs across draws — this properly accounts for treatment-model uncertainty.

Rubin’s Rules for Combining Results

Rubin's Rules

When combining results from $K$ models fitted to slightly different datasets (or draws), the combined ATE and standard error are:
$\overset{ˉ}{Δ} = \frac{1}{K} k = 1 \sum K \hat{Δ}_{k}$ $SE_{combined} = avg within-draw variance \frac{1}{K} k = 1 \sum K \overset{σ}{^}_{k}^{2} + between-draw variance Var (\hat{Δ}_{k})$
The between-draw variance captures the uncertainty from the treatment model. Simply averaging SEs is incorrect; Rubin’s rules give the proper pooled SE.

Limitations and Open Questions

Is the result truly Bayesian? The outcome model here is frequentist (OLS). The distribution of ATEs is a mathematical transformation of the treatment model’s posterior, but the uncertainty in the outcome model is not quantified Bayesianly. Strictly speaking, this is only quasi-Bayesian.
Full Bayesian outcome model: Running brm() 2,000 times is computationally prohibitive. A fully Bayesian solution would use one brm() run where the weights change per iteration (possible via custom brms Stan code, as noted by Jordan Nafa in the follow-up post).
Efficiency: This approach is equivalent to bootstrapping the treatment uncertainty. The interpretation of the resulting “posterior-like” distribution as a credible interval is debatable.

Comparison: Approaches to Bayesian Causal Inference

Method	Treatment of Confounders	Fully Bayesian?
Standard IPW (frequentist)	Single propensity score	No
Liao-Zigler	Posterior propensity scores, Rubin’s rules	Quasi-Bayesian
BART + propensity scores	BART model; see Nonparametric Causal Inference	Yes
Structural/outcome model	Direct Bayesian regression on outcomes	Yes (but requires correct model)

Connections

Nonparametric Causal Inference — BART-based ATE estimation using propensity scores; fully Bayesian approach
Directed Acyclic Graphs — DAG identifies which variables to include in the treatment model
The Selection Problem — IPW is one solution to the selection-into-treatment problem
Differences-in-Differences — Another quasi-experimental approach; IPW adjusts for observed confounders while DiD adjusts for time-invariant unobservables
Counterfactual Inference — Bayesian counterfactual prediction in time-series settings

Second Brain

Explorer

Bayesian Inverse Probability Weighting

Bayesian Inverse Probability Weighting

Overview

Standard Frequentist IPW

What Pseudo-Populations Do

Why Bayesian IPW is Hard

Liao-Zigler Method: A Legal Bayesian Approach

Rubin’s Rules for Combining Results

Limitations and Open Questions

Comparison: Approaches to Bayesian Causal Inference

Connections

See Also

Graph View

Table of Contents

Backlinks