Overfitting and Information Criteria

Summary

Chapter 6 of Statistical Rethinking covers the bias-variance tradeoff using the metaphors of Scylla (overfitting) and Charybdis (underfitting). Introduces information theory, KL divergence, and information criteria (AIC, DIC, WAIC) as tools for navigating this tradeoff. Also covers regularizing priors as a Bayesian alternative.

The Problem with Parameters

$R^{2}$ always increases with more parameters — even random predictors improve in-sample fit
More complex models overfit — they learn noise and predict new data worse
Simpler models underfit — they miss real patterns

The brain-size example (6 hominin species): a 5th-degree polynomial achieves $R^{2} = 0.99$ but predicts negative brain volumes between data points.

Two Families of Solutions

1. Regularizing Priors

Informative priors that are skeptical of extreme parameter values:

Prevent overfitting by keeping parameters modest
The Bayesian version of “penalized likelihood” / ridge / lasso
See Bayesian Linear Regression for specific prior choices (horseshoe, etc.)

2. Information Criteria

Score models on out-of-sample predictive accuracy estimated from in-sample fit.

Information Theory Foundations

Kullback-Leibler divergence measures the distance from a model $q$ to the true distribution $p$ :

D_{K L} (p, q) = i \sum p_{i} lo g \frac{p _{i}}{q _{i}}

We can’t compute $D_{K L}$ directly (don’t know $p$ ), but we can estimate differences in $D_{K L}$ between models using deviance:

D = - 2 i \sum lo g q (y_{i})

The Criteria Zoo

Criterion	Formula	Assumptions
AIC	$D_{train} + 2 k$	Flat priors, Gaussian posterior, $n ≫ k$
DIC	$\overset{ˉ}{D} + p_{D}$ where $p_{D} = \overset{ˉ}{D} - D (\overset{ˉ}{θ})$	Point-estimate posterior
WAIC	$- 2 (lppd - p_{WAIC})$	Fully Bayesian, pointwise

WAIC is the most general: it uses the full posterior, makes no Gaussian approximation, and computes the effective number of parameters pointwise.

WAIC is Preferred

Among information criteria, WAIC is the most Bayesian and makes the fewest assumptions. It converges to AIC when priors are flat and the posterior is Gaussian.

Second Brain

Explorer

Overfitting and Information Criteria

Overfitting and Information Criteria

The Problem with Parameters

Two Families of Solutions

1. Regularizing Priors

2. Information Criteria

Information Theory Foundations

The Criteria Zoo

See Also

Graph View

Table of Contents

Backlinks