Regularized Horseshoe (Finnish Horseshoe)
Summary
The regularized horseshoe multiplies each local scale by a soft truncation that caps it at a slab scale , via . Small coefficients see the original horseshoe; large ones see a Gaussian slab , so the heavy Cauchy tails are soft-truncated. This guarantees a minimum amount of regularization even for the strongest signals, curing the horseshoe’s failure under weak/flat likelihoods (separable logistic regression) while preserving its sparsity. It is the continuous counterpart of a spike-and-slab with a finite-width slab.
Overview
The original The Horseshoe Prior leaves large coefficients unregularized — usually praised, but harmful when the likelihood is weak: under separation in logistic regression the MLE diverges, and the Cauchy-tailed horseshoe lets , so posterior means can vanish (the same pathology as the Cauchy prior). The fix borrows the slab idea from Spike-and-Slab Prior for Covariate Selection: cap the prior variance of the big coefficients at a finite slab scale . This note gives the definition, the regularized shrinkage factor, why it fixes weak-likelihood problems, slab-prior choice, and the practical (Stan/rstanarm) parameterization.
Main Content
Regularized horseshoe prior
with slab scale . Limits: when (small coefficient), → original horseshoe; when (large coefficient), → prior . As the original horseshoe is recovered.
Product-of-factors interpretation
The conditional prior factorizes as
The prior behaves (roughly) as the narrower of the two factors: the horseshoe shrinks small signals, while the Gaussian slab “soft-truncates” the extreme horseshoe tails, controlling the magnitude of the largest .
Regularized shrinkage factor and effective model size
The regularized horseshoe shifts the left mode of the density from to (the horseshoe on becomes one on ). The shrinkage factor satisfies , so . With and ,
where is the original horseshoe’s effective nonzeros. Effective complexity is thus always smaller than the pure horseshoe’s, because even far-from-zero coefficients are still touched by the slab. The formula from Choosing the Global Scale and Effective Nonzeros still applies, with read as the prior guess for the number of coefficients far from zero.
Slab prior and exact-horseshoe variant
Rather than fixing , place a prior on :
which makes the slab a Student- for the far-from-zero coefficients — a good weakly-informative default whose light left tail avoids over-shrinking already-large coefficients. To retain the exact horseshoe shape (rather than the close approximation above) one can instead use ; the simpler form is preferred in practice since is usually negligible vs. .
Why it fixes weak-likelihood / separation problems. Capping the prior variance at prevents the largest coefficients from running to infinity when the likelihood is flat, so posterior means stay finite and well-behaved — exactly the regime where the original horseshoe (and Cauchy prior) fail. This also supersedes the earlier “hierarchical shrinkage” idea (raising the local degrees of freedom ), which reduces sparsity and is no longer recommended.
Spike-and-slab connection. Because the slab gives a finite width , the regularized horseshoe is the continuous counterpart of a spike-and-slab with a finite slab; the original horseshoe corresponds to an infinitely wide slab. See Spike-and-Slab Prior for Covariate Selection.
Non-Gaussian models. For GLMs replace with a pseudo-variance from a Gaussian (Laplace) approximation to the likelihood. Per-observation ; a crude single-value plug-in uses the variance function, e.g. binomial-logit , so for balanced binary classification.
Examples
- Default slab: , gives a Student- slab — coefficients far from zero are softly capped around a few units (on standardized scale).
- Stan parameterization (App. C): use the non-centered form
beta = z .* lambda_tilde * tauwithz ~ normal(0,1),lambda ~ student_t(nu_local,0,1)(nu_local = 1 → half-Cauchy),tau ~ student_t(nu_global,0,scale_global*sigma),caux ~ inv_gamma(0.5*slab_df, 0.5*slab_df),c = slab_scale*sqrt(caux), andlambda_tilde = sqrt(c^2*lambda^2 ./ (c^2 + tau^2*lambda^2)). Setscale_global. A heavier non-centered variant (aux1/aux2decomposition, Peltola et al. 2014) avoids divergences from the funnel-shaped posterior. - rstanarm:
tau0 <- p0/(D-p0)/sqrt(n)thenprior_coeff <- hs(df=1, global_df=1, global_scale=tau0)and fit withstan_glm(..., prior = prior_coeff)(rstanarm scales by automatically). For logistic, usesigma <- 1/sqrt(mean(y)*(1-mean(y)))as the pseudo-sigma intau0.
Connections
- Extends The Horseshoe Prior by adding the slab; lives in Global-Local Shrinkage Priors.
- Reuses the calibration of Choosing the Global Scale and Effective Nonzeros and shrinks its by .
- Continuous, finite-width counterpart of Spike-and-Slab Prior for Covariate Selection.
- Curbing the largest coefficients is a regularization/overfitting control, cf. Overfitting and Information Criteria.
See Also
- The Horseshoe Prior — the base prior being regularized
- Choosing the Global Scale and Effective Nonzeros — and
- Global-Local Shrinkage Priors — the scale-mixture framework
- Spike-and-Slab Prior for Covariate Selection — finite-slab discrete analogue
- Horseshoe and Regularized Horseshoe Priors — overview hub
- Overfitting and Information Criteria — regularization and model complexity