Choosing the Global Scale and Effective Nonzeros

Summary

The single most consequential hyperparameter of the horseshoe is the global scale $τ$ , and there was no principled way to set it. Piironen & Vehtari define the effective number of nonzero coefficients $m_{eff} = \sum_{j} (1 - κ_{j})$ and derive its prior mean as a function of $τ$ . Inverting that relation turns a prior guess $p_{0}$ for the number of relevant variables into a concrete scale $τ_{0} = \frac{p _{0}}{D - p _{0}} \frac{σ}{n}$ . The key lesson: $τ$ must scale as $σ / n$ , so the popular default $τ \sim C^{+} (0, 1)$ is usually a poor choice.

Overview

For a fixed $τ$ , the implied sparsity depends on the dimension $D$ , the noise $σ$ , and the sample size $n$ — so reasoning about a single $κ_{j}$ is not enough. The right object is the aggregate effective model size. The authors prefer full Bayesian inference for $τ$ over plug-in estimates (the marginal-likelihood estimate can collapse to $\overset{τ}{^} = 0$ for very sparse vectors; cross-validation ignores posterior uncertainty), but the prior still needs a sensible location — which $m_{eff}$ supplies. This note builds directly on the shrinkage factor of Global-Local Shrinkage Priors and feeds the slab logic of Regularized Horseshoe (Finnish Horseshoe).

Main Content

Effective number of nonzero coefficients

$m_{eff} = j = 1 \sum D (1 - κ_{j}) .$
When the $κ_{j}$ are near 0 or 1 (as for the horseshoe), $m_{eff}$ counts how many coefficients are active/unshrunk — an interpretable measure of effective model size.

Prior mean and variance of $m_{eff}$

Using $E (κ_{j} ∣ τ, σ) = \frac{1}{1 + a _{j}}$ and $Var (κ_{j} ∣ τ, σ) = \frac{a _{j}}{2 ( 1 + a _{j} ) ^{2}}$ with $a_{j} = τ σ^{- 1} n s_{j}$ :
$E (m_{eff} ∣ τ, σ) = j = 1 \sum D \frac{a _{j}}{1 + a _{j}}, Var (m_{eff} ∣ τ, σ) = j = 1 \sum D \frac{a _{j}}{2 ( 1 + a _{j} ) ^{2}} .$
For standardized predictors ( $s_{j}^{2} = 1$ ) these simplify to $E (m_{eff} ∣ τ, σ) = \frac{τ σ ^{- 1} n}{1 + τ σ ^{- 1} n} D, Var (m_{eff} ∣ τ, σ) = \frac{τ σ ^{- 1} n}{2 ( 1 + τ σ ^{- 1} n ) ^{2}} D .$

Prior-guess formula for the global scale

Solving $E (m_{eff} ∣ τ, σ) = p_{0}$ for standardized predictors gives the scale that places most prior mass for $m_{eff}$ near a prior guess $p_{0}$ :
$τ_{0} = \frac{p _{0}}{D - p _{0}} \frac{σ}{n} .$
Either fix $τ = τ_{0}$ or, better, use it as the scale of a weakly-informative half-normal/half-Cauchy hyperprior, e.g. $τ \sim C^{+} (0, τ_{0}^{2})$ . Two structural facts: (i) $τ$ must scale as $σ / n$ to keep $m_{eff}$ beliefs invariant to $σ$ and $n$ ; (ii) $τ_{0}$ is typically far from 1 or $σ$ , the scales used by the defaults $τ \sim C^{+} (0, 1)$ and $τ ∣ σ \sim C^{+} (0, σ^{2})$ .

Connection to the oracle result

For the simplified model $y_{i} = β_{i} + ε_{i}$ ( $X = I$ , $D = n$ ), van der Pas et al. (2014) prove the minimax-optimal scale (up to a log factor) is $τ^{*} = p^{*} / n$ , where $p^{*}$ is the true number of nonzeros. Setting $p_{0} = p^{*}$ and $σ = 1$ , the $τ_{0}$ formula gives $τ_{0} \to p^{*} / D = τ^{*}$ as $n, p^{*} \to \infty$ with $p^{*} = o (n)$ . So $m_{eff}$ -based tuning recovers the oracle but is more generally applicable.

Why the default $C^{+} (0, 1)$ is dubious. Sampling $τ \sim p (τ)$ , $λ_{j} \sim C^{+} (0, 1)$ , then computing $m_{eff}$ , shows: $τ = τ_{0}$ gives a near-symmetric prior around $p_{0}$ ; a half-normal $N^{+} (0, τ_{0}^{2})$ skews toward $m_{eff} < p_{0}$ ; a half-Cauchy $C^{+} (0, τ_{0}^{2})$ adds a thick tail. But $τ \sim C^{+} (0, 1)$ places far too much mass on large $τ$ , favoring solutions with most coefficients unshrunk — sensible only when $τ$ is strongly identified by data. Crucially, the first three priors keep the same $m_{eff}$ prior under changes in $σ$ or $n$ ; $C^{+} (0, 1)$ does not.

Examples

Worked $τ_{0}$ : $D = 1000$ , $n = 200$ , $σ = 1$ , prior guess $p_{0} = 5$ relevant variables gives $τ_{0} = \frac{5}{995} \cdot \frac{1}{200} \approx 3.6 \times 1 0^{- 4}$ .
Five of a hundred: $p_{0} = 5$ of $D = 100$ , $n = 200$ , $σ = 1$ : $τ_{0} = \frac{5}{95} \cdot \frac{1}{200} \approx 3.7 \times 1 0^{- 3}$ .
Sampling $m_{eff}$ : to inspect any $p (τ)$ , draw $τ \sim p (τ)$ and $λ_{j} \sim C^{+} (0, 1)$ , compute $κ_{j}$ from the shrinkage-factor formula, then $m_{eff} = \sum_{j} (1 - κ_{j})$ — works for any scale-mixture prior even when closed-form moments are unavailable.

Connections

Aggregates the shrinkage factor $κ_{j}$ of Global-Local Shrinkage Priors.
Tunes $τ$ for The Horseshoe Prior; the same $τ_{0}$ (with $p_{0}$ = guess for coefficients far from zero) carries to the Regularized Horseshoe (Finnish Horseshoe), where $\overset{m}{ˉ}_{eff} = (1 - b) m_{eff}$ with $b = (1 + n σ^{- 2} c^{2})^{- 1}$ .
“Effective model size” is the Bayesian-shrinkage analogue of effective parameters in Overfitting and Information Criteria.
$σ / n$ scaling and pooling strength echo Hierarchical Linear Models.

Second Brain

Explorer

Choosing the Global Scale and Effective Nonzeros

Choosing the Global Scale and Effective Nonzeros

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks