Overfitting and Information Criteria

Summary

Chapter 6 of Statistical Rethinking covers the bias-variance tradeoff using the metaphors of Scylla (overfitting) and Charybdis (underfitting). Introduces information theory, KL divergence, and information criteria (AIC, DIC, WAIC) as tools for navigating this tradeoff. Also covers regularizing priors as a Bayesian alternative.

The Problem with Parameters

  • always increases with more parameters — even random predictors improve in-sample fit
  • More complex models overfit — they learn noise and predict new data worse
  • Simpler models underfit — they miss real patterns

The brain-size example (6 hominin species): a 5th-degree polynomial achieves but predicts negative brain volumes between data points.

Two Families of Solutions

1. Regularizing Priors

Informative priors that are skeptical of extreme parameter values:

  • Prevent overfitting by keeping parameters modest
  • The Bayesian version of “penalized likelihood” / ridge / lasso
  • See Bayesian Linear Regression for specific prior choices (horseshoe, etc.)

2. Information Criteria

Score models on out-of-sample predictive accuracy estimated from in-sample fit.

Information Theory Foundations

Kullback-Leibler divergence measures the distance from a model to the true distribution :

We can’t compute directly (don’t know ), but we can estimate differences in between models using deviance:

The Criteria Zoo

CriterionFormulaAssumptions
AICFlat priors, Gaussian posterior,
DIC where Point-estimate posterior
WAICFully Bayesian, pointwise

WAIC is the most general: it uses the full posterior, makes no Gaussian approximation, and computes the effective number of parameters pointwise.

WAIC is Preferred

Among information criteria, WAIC is the most Bayesian and makes the fewest assumptions. It converges to AIC when priors are flat and the posterior is Gaussian.

See Also