Spurious Association and Confounds
Summary
Chapter 5 of Statistical Rethinking covers multivariate regression — using multiple predictors to distinguish genuine causal effects from spurious correlations. Three key phenomena: spurious association (confounds make unrelated variables appear correlated), masked relationships (confounds hide real effects), and post-treatment bias (controlling for consequences of treatment).
The Waffle House Example
Waffle House density correlates with divorce rate across U.S. states. But this is spurious — both variables correlate with being a Southern state. Multiple regression reveals that once you control for median age at marriage, the Waffle House association vanishes.
Three Reasons for Multiple Regression
- Control for confounds — reveal that an association is spurious, or unmask a hidden one
- Multiple causation — estimate independent contributions of multiple causes
- Interactions — the importance of one variable may depend on another
Spurious Association
When a confound causes both predictor and outcome , a bivariate regression of on will show a “significant” effect that disappears when is included.
Masked Relationships
When two predictors have opposing effects and are correlated, each can mask the other’s true effect. Only by including both predictors do the true effects emerge.
When Adding Variables Hurts
Post-treatment bias
Never Control for Post-Treatment Variables
If treatment → mediator → outcome, controlling for the mediator blocks the treatment’s indirect effect. Example: soil treatment → fungus reduction → plant growth. Controlling for fungus makes treatment appear ineffective.
This connects directly to the econometric concept of bad controls — variables affected by treatment should not be included as controls.
Multicollinearity
When two predictors are highly correlated, their individual effects become unidentifiable — the posterior for each is wide even though they jointly predict well.
Categorical Variables
- Binary: use a single dummy variable ( dummies for categories)
- Index variable: assign each category an integer index and estimate a vector of intercepts — more natural for multilevel models
- Contrasts: compute posterior distributions of differences between categories from samples
Don't Compare Marginal Significance
Even if is “significant” and is not, the difference may not be significant. Always compute the contrast directly.
See Also
- Conditional Independence Assumption — the econometric parallel to controlling for confounds
- Omitted Variables Bias — what happens when you don’t control for confounds
- The Selection Problem — the fundamental challenge these methods address
- Moderation Analysis — Ch 6 of Statistical Rethinking, the natural next step: when interaction terms are needed alongside confound control
- Counterfactual Inference — explicit counterfactual framing of what it means for a regression to “control for” a variable
- Linear Models in Statistical Rethinking — Ch 4, the single-predictor foundation
- Bayesian Linear Regression — BDA3’s formal treatment
- Statistical Rethinking - Overview
- Data Collection Models — ignorability is the formal condition under which controlling for confounds gives a causal interpretation
- Directed Acyclic Graphs — DAG framework for identifying forks, pipes, and colliders that this chapter reasons about informally
- Regression and the CEF — MHE’s econometric treatment of multivariate regression as CEF approximation, the frequentist parallel to Statistical Rethinking’s confound analysis