Overfitting and Information Criteria
Summary
Chapter 6 of Statistical Rethinking covers the bias-variance tradeoff using the metaphors of Scylla (overfitting) and Charybdis (underfitting). Introduces information theory, KL divergence, and information criteria (AIC, DIC, WAIC) as tools for navigating this tradeoff. Also covers regularizing priors as a Bayesian alternative.
The Problem with Parameters
- always increases with more parameters — even random predictors improve in-sample fit
- More complex models overfit — they learn noise and predict new data worse
- Simpler models underfit — they miss real patterns
The brain-size example (6 hominin species): a 5th-degree polynomial achieves but predicts negative brain volumes between data points.
Two Families of Solutions
1. Regularizing Priors
Informative priors that are skeptical of extreme parameter values:
- Prevent overfitting by keeping parameters modest
- The Bayesian version of “penalized likelihood” / ridge / lasso
- See Bayesian Linear Regression for specific prior choices (horseshoe, etc.)
2. Information Criteria
Score models on out-of-sample predictive accuracy estimated from in-sample fit.
Information Theory Foundations
Kullback-Leibler divergence measures the distance from a model to the true distribution :
We can’t compute directly (don’t know ), but we can estimate differences in between models using deviance:
The Criteria Zoo
| Criterion | Formula | Assumptions |
|---|---|---|
| AIC | Flat priors, Gaussian posterior, | |
| DIC | where | Point-estimate posterior |
| WAIC | Fully Bayesian, pointwise |
WAIC is the most general: it uses the full posterior, makes no Gaussian approximation, and computes the effective number of parameters pointwise.
WAIC is Preferred
Among information criteria, WAIC is the most Bayesian and makes the fewest assumptions. It converges to AIC when priors are flat and the posterior is Gaussian.
See Also
- Model Comparison — BDA3’s treatment of model comparison (Ch 7)
- Model Checking — posterior predictive checks, complementary to information criteria
- Bayesian Linear Regression — regularizing priors in regression context
- Linear Models in Statistical Rethinking — the models this chapter evaluates
- Statistical Rethinking - Overview
- Hierarchical Models — partial pooling is a form of regularization that directly reduces effective model complexity
- Bayesian Workflow - Overview — information criteria (WAIC/LOO) are the quantitative tools in the iterative model comparison step
- Probability and Bayesian Inference — KL divergence and the log score are grounded in the probability theory introduced there