Regularized Horseshoe (Finnish Horseshoe)

Summary

The regularized horseshoe multiplies each local scale by a soft truncation that caps it at a slab scale , via . Small coefficients see the original horseshoe; large ones see a Gaussian slab , so the heavy Cauchy tails are soft-truncated. This guarantees a minimum amount of regularization even for the strongest signals, curing the horseshoe’s failure under weak/flat likelihoods (separable logistic regression) while preserving its sparsity. It is the continuous counterpart of a spike-and-slab with a finite-width slab.

Overview

The original The Horseshoe Prior leaves large coefficients unregularized — usually praised, but harmful when the likelihood is weak: under separation in logistic regression the MLE diverges, and the Cauchy-tailed horseshoe lets , so posterior means can vanish (the same pathology as the Cauchy prior). The fix borrows the slab idea from Spike-and-Slab Prior for Covariate Selection: cap the prior variance of the big coefficients at a finite slab scale . This note gives the definition, the regularized shrinkage factor, why it fixes weak-likelihood problems, slab-prior choice, and the practical (Stan/rstanarm) parameterization.

Main Content

Regularized horseshoe prior

with slab scale . Limits: when (small coefficient), → original horseshoe; when (large coefficient), → prior . As the original horseshoe is recovered.

Product-of-factors interpretation

The conditional prior factorizes as

The prior behaves (roughly) as the narrower of the two factors: the horseshoe shrinks small signals, while the Gaussian slab “soft-truncates” the extreme horseshoe tails, controlling the magnitude of the largest .

Regularized shrinkage factor and effective model size

The regularized horseshoe shifts the left mode of the density from to (the horseshoe on becomes one on ). The shrinkage factor satisfies , so . With and ,

where is the original horseshoe’s effective nonzeros. Effective complexity is thus always smaller than the pure horseshoe’s, because even far-from-zero coefficients are still touched by the slab. The formula from Choosing the Global Scale and Effective Nonzeros still applies, with read as the prior guess for the number of coefficients far from zero.

Slab prior and exact-horseshoe variant

Rather than fixing , place a prior on :

which makes the slab a Student- for the far-from-zero coefficients — a good weakly-informative default whose light left tail avoids over-shrinking already-large coefficients. To retain the exact horseshoe shape (rather than the close approximation above) one can instead use ; the simpler form is preferred in practice since is usually negligible vs. .

Why it fixes weak-likelihood / separation problems. Capping the prior variance at prevents the largest coefficients from running to infinity when the likelihood is flat, so posterior means stay finite and well-behaved — exactly the regime where the original horseshoe (and Cauchy prior) fail. This also supersedes the earlier “hierarchical shrinkage” idea (raising the local degrees of freedom ), which reduces sparsity and is no longer recommended.

Spike-and-slab connection. Because the slab gives a finite width , the regularized horseshoe is the continuous counterpart of a spike-and-slab with a finite slab; the original horseshoe corresponds to an infinitely wide slab. See Spike-and-Slab Prior for Covariate Selection.

Non-Gaussian models. For GLMs replace with a pseudo-variance from a Gaussian (Laplace) approximation to the likelihood. Per-observation ; a crude single-value plug-in uses the variance function, e.g. binomial-logit , so for balanced binary classification.

Examples

  • Default slab: , gives a Student- slab — coefficients far from zero are softly capped around a few units (on standardized scale).
  • Stan parameterization (App. C): use the non-centered form beta = z .* lambda_tilde * tau with z ~ normal(0,1), lambda ~ student_t(nu_local,0,1) (nu_local = 1 → half-Cauchy), tau ~ student_t(nu_global,0,scale_global*sigma), caux ~ inv_gamma(0.5*slab_df, 0.5*slab_df), c = slab_scale*sqrt(caux), and lambda_tilde = sqrt(c^2*lambda^2 ./ (c^2 + tau^2*lambda^2)). Set scale_global . A heavier non-centered variant (aux1/aux2 decomposition, Peltola et al. 2014) avoids divergences from the funnel-shaped posterior.
  • rstanarm: tau0 <- p0/(D-p0)/sqrt(n) then prior_coeff <- hs(df=1, global_df=1, global_scale=tau0) and fit with stan_glm(..., prior = prior_coeff) (rstanarm scales by automatically). For logistic, use sigma <- 1/sqrt(mean(y)*(1-mean(y))) as the pseudo-sigma in tau0.

Connections

See Also