James-Stein Estimator

Summary

The James-Stein (JS) estimator shrinks the vector of observed values $z$ toward a center by an empirically estimated factor: $\overset{μ}{^}^{(J S)} = (1 - \frac{N - 2}{S}) z$ with $S = ∥ z ∥^{2}$ . It is exactly the Bayes estimator $\overset{μ}{^}^{(B a yes)} = (1 - \frac{1}{A + 1}) z$ with the unknown shrinkage term $1/ (A + 1)$ replaced by its unbiased estimate $(N - 2) / S$ formed from the marginal distribution. The general “shrink toward the grand mean” form pulls each $z_{i}$ toward $\overset{z}{ˉ}$ . This is empirical Bayes in action: the prior is estimated from the $N$ parallel cases.

Overview

Consider $N$ parallel normal estimation problems (Efron eq. 1.7):

μ_{i} \sim N (0, A) and z_{i} ∣ μ_{i} \sim N (μ_{i}, 1), i = 1, \dots, N,

with total squared-error loss $L (μ, \hat{μ}) = ∥ \hat{μ} - μ ∥^{2} = \sum_{i = 1}^{N} (\overset{μ}{^}_{i} - μ_{i})^{2}$ and risk $R (μ) = E_{μ} {L}$ .

The obvious estimator — used implicitly in every regression and ANOVA — is the MLE $\hat{μ}^{(M L E)} = z$ , with constant risk $R^{(M L E)} (μ) = N$ for every $μ$ . If the prior were known, Bayes rule (eq. 1.10) gives posterior $μ ∣ z \sim N_{N} (B z, B I)$ with $B = A / (A + 1)$ , and the Bayes estimator (eq. 1.16) is

\hat{μ}^{(B a yes)} = B z = (1 - \frac{1}{A + 1}) z .

With $A = 1$ this shrinks the MLE halfway toward $0$ . But if $A$ is unknown we cannot use it — this is precisely where empirical Bayes enters (see Empirical Bayes - Overview).

Main Content

Estimating the shrinkage factor from the marginal ^marginal-estimate

Integrating the prior out, the marginal distribution of $z$ (eq. 1.20) is
$z \sim N_{N} (0, (A + 1) I) .$
Hence $S = ∥ z ∥^{2} \sim (A + 1) χ_{N}^{2}$ , which yields the key unbiased estimate of the shrinkage term:
$E {\frac{N - 2}{S}} = \frac{1}{A + 1} .$
The unknown Bayes quantity $1/ (A + 1)$ is thus estimable directly from the pooled data — see Robbins Formula and Poisson Empirical Bayes for the analogous nonparametric move.

James-Stein estimator (shrink toward 0) ^js-estimator

Substituting the unbiased estimate $(N - 2) / S$ for $1/ (A + 1)$ in the Bayes rule gives the James-Stein estimator (Efron eq. 1.23):
$\hat{μ}^{(J S)} = (1 - \frac{N - 2}{S}) z, S = ∥ z ∥^{2} = i = 1 \sum N z_{i}^{2} .$
The name “empirical Bayes” is apt: the Bayes estimator (1.16) is itself empirically estimated from the data. This is only possible because $N$ similar problems $z_{i} \sim N (μ_{i}, 1)$ are under simultaneous consideration.

James-Stein estimator (shrink toward the grand mean) ^js-grand-mean

We need not shrink toward $0$ . Starting from the more general prior $μ_{i} \sim in d N (M, A)$ , $z_{i} ∣ μ_{i} \sim in d N (μ_{i}, σ_{0}^{2})$ (eq. 1.32), the Bayes rule $\overset{μ}{^}_{i}^{(B a yes)} = M + B (z_{i} - M)$ with $B = A / (A + σ_{0}^{2})$ has empirical Bayes form (eq. 1.35):
$\overset{μ}{^}_{i}^{(J S)} = \overset{z}{ˉ} + (1 - \frac{( N - 3 ) σ _{0}^{2}}{S}) (z_{i} - \overset{z}{ˉ}),$
with $\overset{z}{ˉ} = \sum z_{i} / N$ and $S = \sum (z_{i} - \overset{z}{ˉ})^{2}$ . Each $z_{i}$ is pulled toward the grand mean $\overset{z}{ˉ}$ by a data-estimated factor; the risk-dominance theorem holds now for $N \geq 4$ (one degree of freedom is spent estimating $\overset{z}{ˉ}$ ).

Overall Bayes risk and the modest EB penalty ^js-risk

The overall Bayes risk of $\hat{μ}^{(B a yes)}$ is $R_{A}^{(B a yes)} = N A / (A + 1)$ , versus $R_{A}^{(M L E)} = N$ . The James-Stein estimator pays only a small penalty for not knowing $A$ (eqs. 1.24-1.25):
$R_{A}^{(J S)} = N \frac{A}{A + 1} + \frac{2}{A + 1}, \frac{R _{A}^{(J S)}}{R _{A}^{(B a yes)}} = 1 + \frac{2}{N \cdot A} .$
For $N = 10$ , $A = 1$ , $R_{A}^{(J S)}$ is only 20% greater than the true Bayes risk — almost all the Bayesian savings are recovered without knowing the prior.

Limited-translation compromise ^limited-translation

To protect genuinely unusual cases from being over-shrunk, Efron’s limited-translation estimator $\overset{μ}{^}_{i}^{(D)}$ (eq. 1.37) follows the JS estimate but never deviates more than $D σ_{0}$ from $z_{i}$ :
$\overset{μ}{^}_{i}^{(D)} = ⎩ ⎨ ⎧ max (\overset{μ}{^}_{i}^{(J S)}, \overset{μ}{^}_{i}^{(M L E)} - D σ_{0}) min (\overset{μ}{^}_{i}^{(J S)}, \overset{μ}{^}_{i}^{(M L E)} + D σ_{0}) for z_{i} > \overset{z}{ˉ}, for z_{i} \leq \overset{z}{ˉ} .$
Taking $D = 1$ in the baseball data costs only ~10% of the overall JS advantage while sharply limiting damage to outliers like Clemente.

Examples

Baseball batting averages (Efron Table 1.1, $N = 18$ )

Early-1970-season batting averages $z_{i} = \overset{μ}{^}_{i}^{(M L E)}$ (hits/45 at-bats) predict true season averages $μ_{i}$ . With grand average $\overset{z}{ˉ} = 0.265$ and $σ_{0}^{2} = \overset{z}{ˉ} (1 - \overset{z}{ˉ}) /45$ (binomial variance), the JS estimates (1.35) shrink each player toward $0.265$ :

Player hits/AB $\overset{μ}{^}^{(M L E)} = z_{i}$ true $μ_{i}$ $\overset{μ}{^}^{(J S)}$
Clemente 18/45 .400 .346 .294
F. Robinson 17/45 .378 .298 .289
Munson 8/45 .178 .316 .247
Alvis 7/45 .156 .200 .242
Grand Average .265 .265 .265

The ratio of total prediction errors is
$\frac{\sum _{1}^{18} ( μ ^ _{i}^{(J S)} - μ _{i} ) ^{2}}{\sum _{1}^{18} ( μ ^ _{i}^{(M L E)} - μ _{i} ) ^{2}} = 0.28,$
a roughly 3.5x accuracy gain for the empirical Bayes estimates. (The $z_{i}$ are binomial here, violating the exact normal theorem conditions, but the JS effect is quite insensitive to the model.)

Player	hits/AB	$\overset{μ}{^}^{(M L E)} = z_{i}$	true $μ_{i}$	$\overset{μ}{^}^{(J S)}$
Clemente	18/45	.400	.346	.294
F. Robinson	17/45	.378	.298	.289
Munson	8/45	.178	.316	.247
Alvis	7/45	.156	.200	.242
Grand Average		.265	.265	.265

Regression-based shrinkage (kidney data)

Combining covariate information with shrinkage (eq. 1.39), JS shrinks toward a fitted regression line $\overset{μ}{^}_{i}^{(re g)} = \hat{M}_{0} + \hat{M}_{1} \cdot age_{i}$ rather than toward $\overset{z}{ˉ}$ :
$\overset{μ}{^}_{i}^{(J S)} = \overset{μ}{^}_{i}^{(re g)} + (1 - \frac{( N - 4 ) σ _{0}^{2}}{S}) (z_{i} - \overset{μ}{^}_{i}^{(re g)}), S = \sum (z_{i} - \overset{μ}{^}_{i}^{(re g)})^{2} .$
See Empirical Bayes Interpretation of Shrinkage.

Connections

Empirical Bayes - Overview — places JS in the broader EB program.
Robbins Formula and Poisson Empirical Bayes — the nonparametric sibling; both estimate the prior from the marginal.
Stein’s Paradox and Risk Dominance — the proof that JS dominates the MLE for $N \geq 3$ .
Empirical Bayes Interpretation of Shrinkage — why " $(N - 2) / S$ " is an estimated prior; link to hierarchical Bayes.
Partial Pooling as Multiple Comparisons Correction — shrinkage toward $\overset{z}{ˉ}$ as partial pooling.
Hierarchical Models — the fully Bayesian generalization.

Second Brain

Explorer

James-Stein Estimator

James-Stein Estimator

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks