Global-Local Shrinkage Priors

Summary

Global-local shrinkage priors write each regression coefficient as a zero-mean Gaussian whose variance is a product of a single global scale $τ$ (shrinks everything toward zero) and a local scale $λ_{j}$ (lets individual coefficients escape). They are the continuous, easy-to-sample alternative to discrete spike-and-slab priors. Every such prior shares the same shrinkage factor $κ_{j}$ , whose prior density distinguishes ridge (mass near $κ = 0$ ), lasso (peaked interior), and the horseshoe (U-shaped, mass at both 0 and 1).

Overview

Two prior families dominate sparse Bayesian estimation: discrete two-component spike-and-slab priors (Spike-and-Slab Prior for Covariate Selection), and continuous shrinkage priors. The spike-and-slab is intuitive (a delta-spike spike makes it Bayesian model averaging) but its posterior is sensitive to the slab width and inclusion probability, and inference over the $2^{D}$ model space is expensive (often needing EP or VI). Continuous shrinkage priors are easy to implement, sample with generic tools (Stan), and can match spike-and-slab performance. This note covers the shared scaffolding; specific members are The Horseshoe Prior and the Regularized Horseshoe (Finnish Horseshoe).

Main Content

Global-local scale mixture

A global-local shrinkage prior on coefficients $β = (β_{1}, \dots, β_{D})$ of the regression $y_{i} = β^{T} x_{i} + ε_{i}$ , $ε_{i} \sim N (0, σ^{2})$ , is a scale mixture of Gaussians
$β_{j} ∣ λ_{j}, τ \sim N (0, τ^{2} λ_{j}^{2}), λ_{j} \sim π (λ_{j}), j = 1, \dots, D,$
where $τ$ is the global scale common to all coefficients and $λ_{j}$ is the local scale specific to $β_{j}$ . The choice of the local mixing density $π (λ_{j})$ defines the family member.

The shrinkage factor $κ_{j}$

With uncorrelated predictors ( $X^{T} X \approx n diag (s_{1}^{2}, \dots, s_{D}^{2})$ , $s_{j}^{2} = Var (x_{j})$ ), the conditional posterior mean is $\overset{ˉ}{β}_{j} = (1 - κ_{j}) \hat{β}_{j}$ relative to the MLE $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ , where
$κ_{j} = \frac{1}{1 + n σ ^{- 2} τ ^{2} s _{j}^{2} λ _{j}^{2}} \in [0, 1] .$
$κ_{j} = 1$ means complete shrinkage to zero, $κ_{j} = 0$ means the coefficient is left at its MLE. This expression holds for any scale-mixture-of-Gaussians prior, regardless of $π (λ_{j})$ — only the implied prior on $κ_{j}$ differs across priors.

Implied prior on $κ_{j}$ (horseshoe)

For the half-Cauchy choice $λ_{j} \sim C^{+} (0, 1)$ , at fixed $τ, σ$ the shrinkage factor follows
$p (κ_{j} ∣ τ, σ) = \frac{1}{π} \frac{a _{j}}{( a _{j}^{2} - 1 ) κ _{j} + 1} \frac{1}{κ _{j} 1 - κ _{j}}, a_{j} = τ σ^{- 1} n s_{j} .$
When $a_{j} = 1$ this reduces to $Beta (\frac{1}{2}, \frac{1}{2})$ — the U-shaped “horseshoe” density with spikes at $κ = 0$ and $κ = 1$ .

Where the classics sit in $κ$ -space. The shape of $p (κ_{j})$ is the cleanest way to compare priors:

Ridge / Gaussian ( $λ_{j}$ fixed, i.e. a plain $N (0, τ^{2})$ ): all coefficients share one variance, so $κ_{j}$ concentrates at a single interior value — uniform shrinkage of every coefficient, no separation of signal from noise.
Lasso / Laplace (double-exponential, $λ_{j}^{2} \sim Exp$ ): a single interior mode for $κ_{j}$ ; shrinks moderately and cannot simultaneously leave strong signals unshrunk and crush noise.
Horseshoe (half-Cauchy $λ_{j}$ ): $Beta (\frac{1}{2}, \frac{1}{2})$ -like U-shape — mass at $κ = 0$ (relevant, no shrinkage, thanks to heavy Cauchy tails) and at $κ = 1$ (irrelevant, complete shrinkage). This bimodality is exactly the sparse behavior we want. See The Horseshoe Prior.

Changing $τ$ (equivalently $a_{j}$ ) tilts the U: small $τ$ (e.g. $a_{j} = 0.1$ ) pushes mass toward $κ = 1$ (more coefficients shrunk), large $τ$ pushes toward $κ = 0$ . Because for fixed $τ$ the sparsity also depends on dimension $D$ , one must reason about all $κ_{j}$ jointly — leading to $m_{eff}$ in Choosing the Global Scale and Effective Nonzeros.

Examples

Why scale predictors: $a_{j} \propto s_{j}$ , so variables with larger scale $s_{j}$ are treated as more relevant a priori. Standardize to $s_{j}^{2} = 1$ unless the raw scales genuinely carry relevance information; alternatively absorb the scale into the local prior, $λ_{j} \sim C^{+} (0, s_{j}^{- 2})$ .
Reading the U-shape: With $a_{j} = 0.1$ , $p (κ_{j})$ piles up near $κ_{j} = 1$ , so a priori most coefficients are expected to be shrunk to zero — the sparse regime.

Connections

The horseshoe and regularized horseshoe are the specific members studied in The Horseshoe Prior and Regularized Horseshoe (Finnish Horseshoe).
Summing $1 - κ_{j}$ over coefficients gives the effective model size in Choosing the Global Scale and Effective Nonzeros.
The discrete counterpart is Spike-and-Slab Prior for Covariate Selection.
Built on top of Bayesian Linear Regression; the global scale $τ$ acts as a hyperparameter analogous to those in Hierarchical Linear Models.

Second Brain

Explorer

Global-Local Shrinkage Priors

Global-Local Shrinkage Priors

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks