Hierarchical Models
Summary
Chapter 5 of BDA3 introduces hierarchical (multilevel) models — the workhorse of applied Bayesian statistics. Parameters are modeled as exchangeable draws from a common population distribution, enabling partial pooling between groups. Partial pooling optimally balances bias and variance: it shrinks noisy estimates toward the group mean while preserving well-estimated group effects.
The Core Idea: Exchangeability
Parameters are exchangeable if their joint distribution is invariant to permutations of the indices — we have no prior reason to treat any group differently. By de Finetti’s theorem, exchangeable parameters can be written as conditionally i.i.d. given hyperparameters:
This is the probabilistic justification for the hierarchical model structure: exchangeability implies a prior, not the other way around.
Definition: Exchangeability
A sequence is exchangeable if for any permutation :
Finite exchangeability implies a hierarchical model with hyperparameter .
The Eight Schools Example
The canonical example (Rubin 1981): estimating coaching effects from 8 independent educational experiments, each with its own estimated effect and known SE .
| Estimator | Approach | Result |
|---|---|---|
| No pooling | High variance; large SEs | |
| Complete pooling | Biased if effects truly differ | |
| Partial pooling | Hierarchical posterior | Optimal bias-variance tradeoff |
The Bayes estimate under the normal hierarchical model:
This is a precision-weighted average of the group observation and the grand mean . The weight on the group observation increases as decreases (more data) or increases (more between-group variation).
Key insight: the posterior for is informed by how much the groups actually vary. If all are similar, and estimates collapse to complete pooling. If they vary widely, is large and estimates approach no-pooling.
Structure of a Hierarchical Model
Definition: Three-Level Hierarchical Model
where are the hyperparameters governing the group-level distribution.
The posterior factorizes as:
Inference proceeds by first marginalizing over , then drawing given the posterior for . In practice this requires MCMC (see MCMC Basics).
Hyperprior Choice
The hyperprior on (the between-group SD) is critical:
- Flat prior : can cause improper posteriors when is small
- Half-Cauchy(): recommended weakly informative prior (Gelman 2006); set to the expected scale of between-group variation
- Half-: heavier tails than half-Cauchy; useful when outlier groups are plausible
- Inverse-Gamma: historically popular but can underestimate — avoid
Boundary Avoidance
When is small (e.g. groups), the posterior for can concentrate near 0, collapsing to complete pooling. Use half-Cauchy or half- hyperpriors with a scale parameter informed by domain knowledge to avoid this.
Partial Pooling as Regularization
Partial pooling is equivalent to regularization of the group-level estimates. The hierarchical prior on acts as a penalty on how far group means deviate from :
- Analogous to ridge regression (L2 penalty) applied to group effects
- The penalty strength is determined from the data via the posterior for , unlike fixed ridge penalties
- In high dimensions, this adaptive regularization is crucial — see Bayesian Linear Regression for the regression analog
This connection makes hierarchical models the Bayesian answer to many problems framed as “multiple comparisons” or “multiple testing” in the frequentist literature. See Partial Pooling as Multiple Comparisons Correction for the formal algebra.
Key Practical Concepts
- Weakly informative hyperpriors: half-Cauchy or half- on avoid boundary issues
- Meta-analysis: hierarchical models are the natural framework for combining estimates across studies — the group-level model is the meta-analytic model
- Non-centered parameterization: for sampling efficiency, reparameterize , — avoids funnel geometry in the posterior
- Varying intercepts and slopes: the regression extension (Hierarchical Linear Models) allows and to vary by group, with partial pooling on each
Connections to Causal Inference
Hierarchical models appear naturally in causal inference:
- Treatment effect heterogeneity: each unit’s effect as an exchangeable draw from a population distribution — connects to Local Average Treatment Effects (compliers form a group)
- Principal stratification (see Instrumental Variables and Principal Stratification): compliance strata are exchangeable latent groups
- Bayesian DiD: see Bayesian Difference in Differences for partial pooling over time periods
See Also
- Single-Parameter Models — building block for each group
- Bayesian Workflow - Overview — iterative building of hierarchical models
- Partial Pooling as Multiple Comparisons Correction — how partial pooling formally serves as a multiple comparisons correction (z-score shrinkage algebra)
- Multiple Comparisons - Bayesian Perspective — Gelman et al. (2009) on multilevel models replacing classical corrections
- Type S and Type M Errors — the error framework that motivates hierarchical modeling over classical corrections
- Local Average Treatment Effects — treatment effect heterogeneity in econometrics
- Differences-in-Differences — frequentist panel approach using similar exchangeability assumptions
- Instrumental Variables — complier heterogeneity parallels hierarchical variation across groups
- Bayesian Linear Regression — the single-level model that hierarchical models generalize
- Overfitting and Information Criteria — model comparison and regularization connect directly to partial pooling
- Garden of Forking Paths — hierarchical models address multiple comparisons that forking paths create
- Researcher Degrees of Freedom — partial pooling regularizes the researcher-flexibility problem structurally
- Linear Models in Statistical Rethinking — the single-level Gaussian model that hierarchical models extend (McElreath Ch. 4 → Ch. 12)
- Generalized Linear Models — hierarchical GLMs add group-level random effects to non-Gaussian likelihoods
- Efficient MCMC — HMC with non-centered parameterization is required for efficient sampling from hierarchical posteriors
- Power Analysis and Sample Size — multilevel models improve effective power by pooling information across groups
- Empirical Bayes - Overview — empirical-Bayes estimation of the prior, the frequentist analogue of hierarchical pooling