Horseshoe and Regularized Horseshoe Priors

Summary

Piironen & Vehtari (2017) fix two long-standing problems with the horseshoe prior for sparse Bayesian regression: (1) there was no principled way to set the global shrinkage scale $τ$ , and (2) the horseshoe leaves large coefficients completely unregularized, which is harmful under weak likelihoods (e.g. separable logistic regression). Their solutions are the effective number of nonzeros $m_{eff}$ that turns prior beliefs about sparsity ( $p_{0}$ ) into a concrete prior for $τ$ , and the regularized (Finnish) horseshoe, which adds a Student- $t$ slab of scale $c$ to softly cap the largest coefficients.

Overview

This is the hub note for an Obsidian cluster on global-local shrinkage priors. The paper is an extension of Piironen & Vehtari (2017a) and targets regression/classification with many predictors $β = (β_{1}, \dots, β_{D})$ of which only a few are expected to be nonzero.

The four companion notes break the contribution into pieces:

Global-Local Shrinkage Priors — the scale-mixture-of-Gaussians framework, the shrinkage factor $κ_{j}$ , and where ridge/lasso/horseshoe sit in $κ$ -space.
The Horseshoe Prior — the Carvalho–Polson–Scott horseshoe with half-Cauchy local scales and the characteristic $Beta (\frac{1}{2}, \frac{1}{2})$ “horseshoe” density on $κ_{j}$ .
Choosing the Global Scale and Effective Nonzeros — $m_{eff}$ and the $τ_{0}$ prior-guess formula.
Regularized Horseshoe (Finnish Horseshoe) — the slab scale $c$ , the regularized local scale $\tilde{λ}_{j}$ , and why it behaves like a continuous spike-and-slab.

The two main theoretical advances are summarized below; details live in the companion notes.

Main Content

The model is the standard linear Gaussian regression with a horseshoe prior on the coefficients.

Horseshoe prior for linear regression

For $y_{i} = β^{T} x_{i} + ε_{i}$ , $ε_{i} \sim N (0, σ^{2})$ , the horseshoe prior is the global-local scale mixture
$β_{j} ∣ λ_{j}, τ \sim N (0, τ^{2} λ_{j}^{2}), λ_{j} \sim C^{+} (0, 1), j = 1, \dots, D,$
where $τ$ is the global scale (pulls all coefficients toward 0) and the half-Cauchy local scales $λ_{j}$ have heavy tails that let some $β_{j}$ escape the shrinkage. An intercept $β_{0}$ gets a relatively flat prior (no reason to shrink it).

Shrinkage factor

Assuming uncorrelated predictors with $Var (x_{j}) = s_{j}^{2}$ (so $X^{T} X \approx n diag (s_{1}^{2}, \dots, s_{D}^{2})$ ), the posterior mean satisfies $\overset{ˉ}{β}_{j} = (1 - κ_{j}) \hat{β}_{j}$ where $\hat{β}_{j}$ is the MLE and
$κ_{j} = \frac{1}{1 + n σ ^{- 2} τ ^{2} s _{j}^{2} λ _{j}^{2}}$
is the shrinkage factor: $κ_{j} = 1$ is complete shrinkage to zero, $κ_{j} = 0$ is no shrinkage. As $τ \to 0$ , $\overset{ˉ}{β} \to 0$ ; as $τ \to \infty$ , $\overset{ˉ}{β} \to \hat{β}$ .

Regularized (Finnish) horseshoe

Replace the local scale by a slab-truncated version:
$β_{j} ∣ λ_{j}, τ, c \sim N (0, τ^{2} \tilde{λ}_{j}^{2}), \tilde{λ}_{j}^{2} = \frac{c ^{2} λ _{j}^{2}}{c ^{2} + τ ^{2} λ _{j}^{2}}, λ_{j} \sim C^{+} (0, 1) .$
When $τ^{2} λ_{j}^{2} ≪ c^{2}$ (small coefficient) $\tilde{λ}_{j}^{2} \to λ_{j}^{2}$ and we recover the original horseshoe; when $τ^{2} λ_{j}^{2} ≫ c^{2}$ (large coefficient) $\tilde{λ}_{j}^{2} \to c^{2} / τ^{2}$ so the prior approaches $N (0, c^{2})$ — a Gaussian slab of width $c$ that “soft-truncates” the heavy Cauchy tails. Letting $c \to \infty$ recovers the unregularized horseshoe.

Prior guess for the global scale

If $p_{0}$ is the prior guess for the number of relevant predictors out of $D$ , set the global scale so that the prior mean of $m_{eff}$ equals $p_{0}$ :
$τ_{0} = \frac{p _{0}}{D - p _{0}} \frac{σ}{n} .$
$τ$ must scale as $σ / n$ to keep prior beliefs about $m_{eff}$ consistent — which is exactly why the default $τ \sim C^{+} (0, 1)$ is a dubious choice (it ignores $σ$ and $n$ and puts far too much mass on large $τ$ ).

The paper also shows the regularized horseshoe is the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe corresponds to spike-and-slab with an infinitely wide slab. See Spike-and-Slab Prior for Covariate Selection.

Examples

Setting $τ_{0}$ : With $D = 1000$ predictors, $n = 200$ observations, $σ \approx 1$ , and a prior guess $p_{0} = 5$ relevant variables: $τ_{0} = \frac{5}{995} \cdot \frac{1}{200} \approx 3.6 \times 1 0^{- 4}$ . This is far from the scale 1 used by the naive $C^{+} (0, 1)$ default.
Logistic regression / separation: When data are separable the likelihood is flat, the MLE diverges, and the Cauchy-tailed horseshoe lets the largest $β_{j} \to \infty$ , making posterior means vanish. The slab scale $c$ (e.g. via $c^{2} \sim Inv-Gamma$ giving a Student- $t_{ν} (0, s^{2})$ slab) caps this. For binary classification a workable plug-in is $\tilde{σ}^{2} = 1/ (μ (1 - μ))$ , e.g. $μ = 0.5 \Rightarrow \tilde{σ}^{2} = 4$ .

Connections

Generalizes the horseshoe (The Horseshoe Prior) within the global-local family (Global-Local Shrinkage Priors).
The $τ$ problem and its $m_{eff}$ solution are in Choosing the Global Scale and Effective Nonzeros.
The slab regularization is in Regularized Horseshoe (Finnish Horseshoe).
Builds on Bayesian Linear Regression; competes with Spike-and-Slab Prior for Covariate Selection.
Connected to model-size control and the bias–variance view in Overfitting and Information Criteria.

Second Brain

Explorer

Horseshoe and Regularized Horseshoe Priors

Horseshoe and Regularized Horseshoe Priors

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents