Missing Data — Statistical Rethinking (Lecture 18)

Summary

Richard McElreath’s Statistical Rethinking treatment of missing data, framed through the lens of causal DAGs. The key insight: whether and how to handle missingness depends on the mechanism generating the missing values, which is encoded in the DAG.

Three Mechanisms of Missingness

The taxonomy (from Rubin 1976) determines what analyses are valid:

MechanismFormal DefinitionImplication
MCAR — Missing Completely At RandomMissingness unrelated to any data; complete-case analysis unbiased
MAR — Missing At RandomMissingness depends only on observed data; imputation valid
MNAR — Missing Not At Random depends on Missingness depends on the missing values themselves; selection model needed

DAG diagnosis

McElreath emphasises drawing a causal DAG with the missingness indicator as a node. Paths from to determine the mechanism. Conditioning on (complete cases) opens collider paths that can induce bias even under MAR.

The Bayesian Approach to Imputation

Treat missing values as latent variables and place a prior on them, jointly with the model parameters. The likelihood marginalises over the missing values:

In PyMC, missing values in a pandas array are automatically handled via pm.MutableData with NaN entries or explicit pm.Normal.dist() imputation priors.

# Example: impute missing predictor
with pm.Model() as model:
    # Prior over missing values
    x_imputed = pm.Normal("x_imputed", mu=0, sigma=1,
                          shape=n_missing)
    x = pt.set_subtensor(x_obs[missing_idx], x_imputed)
 
    # Rest of model uses x (now complete)
    beta = pm.Normal("beta", 0, 1)
    mu = alpha + beta * x
    pm.Normal("y", mu=mu, sigma=sigma, observed=y_obs)

Dog Eating Homework: Illustrative Example

McElreath uses the “dogs eat homework” metaphor: if a student’s grade predicts whether their homework is missing (the dog eats the homework of students who do poorly), then the missingness is MNAR with respect to the true homework quality. Complete-case analysis would overestimate average quality.

The DAG makes this explicit:

Grade → R (missing indicator) → [missing homework]
Grade → Homework quality

Conditioning on observed homework () blocks the path through but induces selection bias on Grade.

Key Takeaways

  1. Always draw the DAG with included before choosing a missing data strategy
  2. MCAR is rare in practice; assuming it when MAR/MNAR holds introduces bias
  3. Bayesian imputation is coherent: uncertainty in imputed values propagates through to parameter posteriors
  4. MNAR requires modelling the missingness mechanism explicitly (a selection model)

Connections

Source