Multiple Testing Corrections
Summary
When performing many statistical tests simultaneously, the probability of false positives increases dramatically. Multiple testing corrections adjust significance thresholds to control error rates. The two main frameworks are FWER (Bonferroni — no false positives allowed) and FDR (Benjamini-Hochberg — tolerate a fixed proportion of false discoveries).
The Problem
With independent tests at :
| Tests () | P(at least one FP) |
|---|---|
| 1 | 5.0% |
| 10 | 40.1% |
| 100 | 99.4% |
| 1000 | ~100% |
This is why the Garden of Forking Paths is so dangerous — even without explicit testing, the implicit multiplicity inflates false positives.
Family-Wise Error Rate (FWER)
Bonferroni Correction
The simplest approach: reject only if .
- Controls: probability that any false positive occurs
- Pro: simple, conservative, valid under any dependency structure
- Con: very conservative — often yields no significant results in large-scale studies
Warning
Bonferroni becomes extremely conservative as grows. With 10,000 tests, the threshold drops to — potentially missing many real effects.
Holm’s Step-Down Procedure
A less conservative FWER method:
- Sort p-values:
- Reject if
- Stop at first non-rejection
Uniformly more powerful than Bonferroni while still controlling FWER.
False Discovery Rate (FDR)
Concept
Instead of preventing all false positives, FDR controls the expected proportion of false discoveries among rejected hypotheses:
Benjamini-Hochberg (BH) Procedure
- Sort p-values:
- Find the largest such that
- Reject all
- Controls FDR at level under independence (or positive dependence)
- Much more powerful than Bonferroni for large-scale testing
Q-Values (Storey)
The q-value of a test is the minimum FDR at which that test would be called significant — analogous to the p-value but for FDR rather than FWER.
When to Use Each
| Method | Best for | Error controlled |
|---|---|---|
| Bonferroni | Few tests, each individually important | FWER (any FP) |
| Holm | Few tests, want more power than Bonferroni | FWER |
| BH/FDR | Many tests, batch follow-up (genomics, imaging) | FDR (proportion of FP) |
| Q-values | Ranking results by reliability | FDR |
Connection to Bayesian Approaches
Tip
Hierarchical Models provide a natural Bayesian alternative to multiple testing corrections. Partial pooling shrinks estimates toward the grand mean, automatically regularizing extreme results — achieving a similar effect to FDR control but derived from the model structure rather than an ad hoc correction. See Forking Paths and Bayesian Approaches.
See Also
- Garden of Forking Paths — why multiple comparisons are a problem even without explicit testing
- Researcher Degrees of Freedom — sources of implicit multiplicity
- Hierarchical Models — the Bayesian structural alternative
- Multiple Comparisons - Bayesian Perspective — Gelman et al. (2009) argue multilevel models replace classical corrections entirely
- Type S and Type M Errors — reframing statistical error beyond Type 1/Type 2
- Partial Pooling as Multiple Comparisons Correction — formal algebra of how shrinkage reduces z-scores
- Power Analysis and Sample Size — designing studies with adequate power
- The Experimental Ideal — even randomized experiments require multiple comparisons correction when testing many outcomes