Spurious Association and Confounds

Summary

Chapter 5 of Statistical Rethinking covers multivariate regression — using multiple predictors to distinguish genuine causal effects from spurious correlations. Three key phenomena: spurious association (confounds make unrelated variables appear correlated), masked relationships (confounds hide real effects), and post-treatment bias (controlling for consequences of treatment).

The Waffle House Example

Waffle House density correlates with divorce rate across U.S. states. But this is spurious — both variables correlate with being a Southern state. Multiple regression reveals that once you control for median age at marriage, the Waffle House association vanishes.

Three Reasons for Multiple Regression

Control for confounds — reveal that an association is spurious, or unmask a hidden one
Multiple causation — estimate independent contributions of multiple causes
Interactions — the importance of one variable may depend on another

Spurious Association

When a confound $C$ causes both predictor $X$ and outcome $Y$ , a bivariate regression of $Y$ on $X$ will show a “significant” effect that disappears when $C$ is included.

Masked Relationships

When two predictors have opposing effects and are correlated, each can mask the other’s true effect. Only by including both predictors do the true effects emerge.

When Adding Variables Hurts

Post-treatment bias

Never Control for Post-Treatment Variables

If treatment → mediator → outcome, controlling for the mediator blocks the treatment’s indirect effect. Example: soil treatment → fungus reduction → plant growth. Controlling for fungus makes treatment appear ineffective.

This connects directly to the econometric concept of bad controls — variables affected by treatment should not be included as controls.

Multicollinearity

When two predictors are highly correlated, their individual effects become unidentifiable — the posterior for each is wide even though they jointly predict well.

Categorical Variables

Binary: use a single dummy variable ( $k - 1$ dummies for $k$ categories)
Index variable: assign each category an integer index and estimate a vector of intercepts — more natural for multilevel models
Contrasts: compute posterior distributions of differences between categories from samples

Don't Compare Marginal Significance

Even if $β_{f}$ is “significant” and $β_{m}$ is not, the difference $β_{f} - β_{m}$ may not be significant. Always compute the contrast directly.

Second Brain

Explorer

Spurious Association and Confounds

Spurious Association and Confounds

The Waffle House Example

Three Reasons for Multiple Regression

Spurious Association

Masked Relationships

When Adding Variables Hurts

Post-treatment bias

Multicollinearity

Categorical Variables

See Also

Graph View

Table of Contents

Backlinks