Hierarchical Models

Summary

Chapter 5 of BDA3 introduces hierarchical (multilevel) models — the workhorse of applied Bayesian statistics. Parameters are modeled as exchangeable draws from a common population distribution, enabling partial pooling between groups. Partial pooling optimally balances bias and variance: it shrinks noisy estimates toward the group mean while preserving well-estimated group effects.

The Core Idea: Exchangeability

Parameters $θ_{1}, \dots, θ_{J}$ are exchangeable if their joint distribution is invariant to permutations of the indices — we have no prior reason to treat any group differently. By de Finetti’s theorem, exchangeable parameters can be written as conditionally i.i.d. given hyperparameters:

θ_{j} ∣ μ, τ \sim N (μ, τ^{2}), j = 1, \dots, J

This is the probabilistic justification for the hierarchical model structure: exchangeability implies a prior, not the other way around.

Definition: Exchangeability

A sequence $θ_{1}, \dots, θ_{J}$ is exchangeable if for any permutation $π$ :
$p (θ_{1}, \dots, θ_{J}) = p (θ_{π (1)}, \dots, θ_{π (J)})$
Finite exchangeability implies a hierarchical model with hyperparameter $ϕ$ .

The Eight Schools Example

The canonical example (Rubin 1981): estimating coaching effects from 8 independent educational experiments, each with its own estimated effect $y_{j}$ and known SE $σ_{j}$ .

Estimator	Approach	Result
No pooling	$\hat{θ}_{j} = y_{j}$	High variance; large SEs
Complete pooling	$\hat{θ}_{j} = \overset{y}{ˉ}$	Biased if effects truly differ
Partial pooling	Hierarchical posterior	Optimal bias-variance tradeoff

The Bayes estimate under the normal hierarchical model:

\hat{θ}_{j}^{Bayes} \approx \frac{\frac{1}{σ _{j}^{2}} y _{j} + \frac{1}{τ ^{2}} μ}{\frac{1}{σ _{j}^{2}} + \frac{1}{τ ^{2}}}

This is a precision-weighted average of the group observation $y_{j}$ and the grand mean $μ$ . The weight on the group observation increases as $σ_{j}$ decreases (more data) or $τ$ increases (more between-group variation).

Key insight: the posterior for $τ$ is informed by how much the groups actually vary. If all $y_{j}$ are similar, $\overset{τ}{^} \approx 0$ and estimates collapse to complete pooling. If they vary widely, $\overset{τ}{^}$ is large and estimates approach no-pooling.

Structure of a Hierarchical Model

Definition: Three-Level Hierarchical Model

$y_{j} ∣ θ_{j} \sim p (y_{j} ∣ θ_{j}) (data model)$ $θ_{j} ∣ ϕ \sim p (θ_{j} ∣ ϕ) (group-level model / prior)$ $ϕ \sim p (ϕ) (hyperprior)$
where $ϕ = (μ, τ)$ are the hyperparameters governing the group-level distribution.

The posterior factorizes as:

p (θ_{1}, \dots, θ_{J}, ϕ ∣ y) \propto p (ϕ) j = 1 \prod J p (θ_{j} ∣ ϕ) p (y_{j} ∣ θ_{j})

Inference proceeds by first marginalizing over $ϕ$ , then drawing $θ_{j} ∣ ϕ$ given the posterior for $ϕ$ . In practice this requires MCMC (see MCMC Basics).

Hyperprior Choice

The hyperprior on $τ$ (the between-group SD) is critical:

Flat prior $p (τ) \propto 1$ : can cause improper posteriors when $J$ is small
Half-Cauchy( $0, s$ ): recommended weakly informative prior (Gelman 2006); $s$ set to the expected scale of between-group variation
Half- $t_{ν}$ : heavier tails than half-Cauchy; useful when outlier groups are plausible
Inverse-Gamma: historically popular but can underestimate $τ$ — avoid

Boundary Avoidance

When $J$ is small (e.g. $J < 5$ groups), the posterior for $τ$ can concentrate near 0, collapsing to complete pooling. Use half-Cauchy or half- $t$ hyperpriors with a scale parameter informed by domain knowledge to avoid this.

Partial Pooling as Regularization

Partial pooling is equivalent to regularization of the group-level estimates. The hierarchical prior on $θ_{j}$ acts as a penalty on how far group means deviate from $μ$ :

Analogous to ridge regression (L2 penalty) applied to group effects
The penalty strength is determined from the data via the posterior for $τ$ , unlike fixed ridge penalties
In high dimensions, this adaptive regularization is crucial — see Bayesian Linear Regression for the regression analog

This connection makes hierarchical models the Bayesian answer to many problems framed as “multiple comparisons” or “multiple testing” in the frequentist literature. See Partial Pooling as Multiple Comparisons Correction for the formal algebra.

Key Practical Concepts

Weakly informative hyperpriors: half-Cauchy or half- $t$ on $τ$ avoid boundary issues
Meta-analysis: hierarchical models are the natural framework for combining estimates across studies — the group-level model is the meta-analytic model
Non-centered parameterization: for sampling efficiency, reparameterize $θ_{j} = μ + τ η_{j}$ , $η_{j} \sim N (0, 1)$ — avoids funnel geometry in the posterior
Varying intercepts and slopes: the regression extension (Hierarchical Linear Models) allows $μ_{j}$ and $β_{j}$ to vary by group, with partial pooling on each

Connections to Causal Inference

Hierarchical models appear naturally in causal inference:

Treatment effect heterogeneity: each unit’s effect $τ_{i}$ as an exchangeable draw from a population distribution — connects to Local Average Treatment Effects (compliers form a group)
Principal stratification (see Instrumental Variables and Principal Stratification): compliance strata are exchangeable latent groups
Bayesian DiD: see Bayesian Difference in Differences for partial pooling over time periods

Second Brain

Explorer

Hierarchical Models

Hierarchical Models

The Core Idea: Exchangeability

The Eight Schools Example

Structure of a Hierarchical Model

Hyperprior Choice

Partial Pooling as Regularization

Key Practical Concepts

Connections to Causal Inference

See Also

Graph View

Table of Contents

Backlinks