Hierarchical Models

Summary

Chapter 5 of BDA3 introduces hierarchical (multilevel) models — the workhorse of applied Bayesian statistics. Parameters are modeled as exchangeable draws from a common population distribution, enabling partial pooling between groups. Partial pooling optimally balances bias and variance: it shrinks noisy estimates toward the group mean while preserving well-estimated group effects.

The Core Idea: Exchangeability

Parameters are exchangeable if their joint distribution is invariant to permutations of the indices — we have no prior reason to treat any group differently. By de Finetti’s theorem, exchangeable parameters can be written as conditionally i.i.d. given hyperparameters:

This is the probabilistic justification for the hierarchical model structure: exchangeability implies a prior, not the other way around.

Definition: Exchangeability

A sequence is exchangeable if for any permutation :

Finite exchangeability implies a hierarchical model with hyperparameter .

The Eight Schools Example

The canonical example (Rubin 1981): estimating coaching effects from 8 independent educational experiments, each with its own estimated effect and known SE .

EstimatorApproachResult
No poolingHigh variance; large SEs
Complete poolingBiased if effects truly differ
Partial poolingHierarchical posteriorOptimal bias-variance tradeoff

The Bayes estimate under the normal hierarchical model:

This is a precision-weighted average of the group observation and the grand mean . The weight on the group observation increases as decreases (more data) or increases (more between-group variation).

Key insight: the posterior for is informed by how much the groups actually vary. If all are similar, and estimates collapse to complete pooling. If they vary widely, is large and estimates approach no-pooling.

Structure of a Hierarchical Model

Definition: Three-Level Hierarchical Model

where are the hyperparameters governing the group-level distribution.

The posterior factorizes as:

Inference proceeds by first marginalizing over , then drawing given the posterior for . In practice this requires MCMC (see MCMC Basics).

Hyperprior Choice

The hyperprior on (the between-group SD) is critical:

  • Flat prior : can cause improper posteriors when is small
  • Half-Cauchy(): recommended weakly informative prior (Gelman 2006); set to the expected scale of between-group variation
  • Half-: heavier tails than half-Cauchy; useful when outlier groups are plausible
  • Inverse-Gamma: historically popular but can underestimate — avoid

Boundary Avoidance

When is small (e.g. groups), the posterior for can concentrate near 0, collapsing to complete pooling. Use half-Cauchy or half- hyperpriors with a scale parameter informed by domain knowledge to avoid this.

Partial Pooling as Regularization

Partial pooling is equivalent to regularization of the group-level estimates. The hierarchical prior on acts as a penalty on how far group means deviate from :

  • Analogous to ridge regression (L2 penalty) applied to group effects
  • The penalty strength is determined from the data via the posterior for , unlike fixed ridge penalties
  • In high dimensions, this adaptive regularization is crucial — see Bayesian Linear Regression for the regression analog

This connection makes hierarchical models the Bayesian answer to many problems framed as “multiple comparisons” or “multiple testing” in the frequentist literature. See Partial Pooling as Multiple Comparisons Correction for the formal algebra.

Key Practical Concepts

  • Weakly informative hyperpriors: half-Cauchy or half- on avoid boundary issues
  • Meta-analysis: hierarchical models are the natural framework for combining estimates across studies — the group-level model is the meta-analytic model
  • Non-centered parameterization: for sampling efficiency, reparameterize , — avoids funnel geometry in the posterior
  • Varying intercepts and slopes: the regression extension (Hierarchical Linear Models) allows and to vary by group, with partial pooling on each

Connections to Causal Inference

Hierarchical models appear naturally in causal inference:

See Also