Spurious Association and Confounds

Summary

Chapter 5 of Statistical Rethinking covers multivariate regression — using multiple predictors to distinguish genuine causal effects from spurious correlations. Three key phenomena: spurious association (confounds make unrelated variables appear correlated), masked relationships (confounds hide real effects), and post-treatment bias (controlling for consequences of treatment).

The Waffle House Example

Waffle House density correlates with divorce rate across U.S. states. But this is spurious — both variables correlate with being a Southern state. Multiple regression reveals that once you control for median age at marriage, the Waffle House association vanishes.

Three Reasons for Multiple Regression

  1. Control for confounds — reveal that an association is spurious, or unmask a hidden one
  2. Multiple causation — estimate independent contributions of multiple causes
  3. Interactions — the importance of one variable may depend on another

Spurious Association

When a confound causes both predictor and outcome , a bivariate regression of on will show a “significant” effect that disappears when is included.

Masked Relationships

When two predictors have opposing effects and are correlated, each can mask the other’s true effect. Only by including both predictors do the true effects emerge.

When Adding Variables Hurts

Post-treatment bias

Never Control for Post-Treatment Variables

If treatment → mediator → outcome, controlling for the mediator blocks the treatment’s indirect effect. Example: soil treatment → fungus reduction → plant growth. Controlling for fungus makes treatment appear ineffective.

This connects directly to the econometric concept of bad controls — variables affected by treatment should not be included as controls.

Multicollinearity

When two predictors are highly correlated, their individual effects become unidentifiable — the posterior for each is wide even though they jointly predict well.

Categorical Variables

  • Binary: use a single dummy variable ( dummies for categories)
  • Index variable: assign each category an integer index and estimate a vector of intercepts — more natural for multilevel models
  • Contrasts: compute posterior distributions of differences between categories from samples

Don't Compare Marginal Significance

Even if is “significant” and is not, the difference may not be significant. Always compute the contrast directly.

See Also