How should I handle multiple comparisons when selecting from hundreds of models?

Summary

Running hundreds of models and selecting variables or transformations based on statistical significance is a textbook instance of the Garden of Forking Paths problem — the p-values from selected models are invalid because the selection procedure is data-contingent. The vault contains a progression of solutions: classical corrections (FDR) are a starting point but become impractical at scale. The recommended approach is to use multilevel models with partial pooling, regularizing priors, and projection predictive variable selection rather than significance-based model selection.

Answer

The Problem: Why Significance-Based Selection Fails

Running hundreds of models and picking variables based on $p < 0.05$ creates a massive multiplicity problem. With $m$ independent tests at $α = 0.05$ (Multiple Testing Corrections):

P (at least one false positive) = 1 - (1 - α)^{m}

With 100 models this probability is 99.4%; with 1000 it is essentially 100%.

But the problem is deeper than the raw count of tests. Gelman & Loken (2013) identify three levels of data-contingent analysis:

Pre-specified test: $T (y; ϕ)$ with $ϕ$ fixed in advance
Researcher degrees of freedom: $T (y; ϕ (y))$ — a single test is run, but different data would have led to a different test
Explicit search: $T (y; ϕ^{best} (y))$ — trying many tests and reporting the best

Selecting variable transformations and model specifications based on significance is structurally identical to case 3, even without intent to deceive (Researcher Degrees of Freedom). The combinatorial explosion is enormous: choosing which variables to include, how to transform them, which interactions to test, and which model form to use creates a space far larger than the number of models explicitly fit.

The Winner's Curse

The model or transformation you select as “most significant” is likely the one with the largest Type M error. When the true effect is near zero, an estimator with a large standard deviation is more likely to produce large point estimates — statistically significant results from a heavily searched model space are likely exaggerated, not evidence of large effects (^thm-type-m-inflation).

Approach 1: Classical Corrections (Limited Utility)

Multiple Testing Corrections adjust significance thresholds:

Method	Threshold for $m = 100$	Controls
Bonferroni	$p \leq 0.0005$	FWER (any false positive)
Holm (step-down)	Adaptive, more powerful	FWER
Benjamini-Hochberg	Adaptive, $p_{(k)} \leq (k / m) α$	FDR (proportion of false positives)

Limitations for model selection:

Bonferroni becomes extremely conservative — with hundreds of models, the threshold drops to levels where almost nothing is significant (Multiple Testing Corrections)
These methods adjust thresholds but leave point estimates unchanged — they don’t address Type M errors (exaggerated effect sizes)
They assume you’re testing pre-specified hypotheses, not searching through a model space
Classical corrections can actually increase Type S and Type M error rates by further reducing power (Type S and Type M Errors)

When Classical Corrections Are Appropriate

Bonferroni/Holm: few tests, each individually important, zero tolerance for false positives. BH/FDR: many parallel tests (genomics, imaging) with batch follow-up planned. Neither is designed for the model search problem (Multiple Testing Corrections).

Approach 2: Reframe What Errors Matter

Rather than asking “is this significant?”, ask (Type S and Type M Errors):

Type S (sign) error: Do I have the direction of the effect right?
Type M (magnitude) error: How exaggerated is my estimate likely to be? The exaggeration ratio is $∣ \overset{τ}{^} ∣/∣ τ ∣$ .

Example: Eight Schools Simulation ( ^ex-eight-schools-sim)

In 1000 simulations of 8 school effects with 28 pairwise comparisons each:

Classical analysis finds at least one significant comparison in 47% of simulations

Of those significant results, only 63% have the correct sign — a 37% Type S error rate

Bayesian hierarchical analysis: significant in only 5% of simulations, with 89% correct sign

Approach 3: Multilevel Models with Partial Pooling (Recommended)

Gelman, Hill & Yajima (2009) argue that multilevel models replace classical corrections entirely. The mechanism is partial pooling:

For a normal-normal hierarchical model, the posterior z-score for any comparison is (^thm-zscore-shrinkage):

z_{Bayes} = z_{classical} \cdot \frac{1}{1 + σ _{\overset{y}{ˉ}}^{2} / σ _{θ}^{2}}

The shrinkage factor is always $< 1$ , and it adapts to the data:

When groups are similar ( $σ_{θ}^{2}$ small): strong shrinkage, aggressive correction
When groups are genuinely different ( $σ_{θ}^{2}$ large): weak shrinkage, little correction needed

Property	Classical (Bonferroni/FDR)	Multilevel (Partial Pooling)
Adjusts	Threshold / interval width	Point estimates + intervals
Point estimates	Unchanged	Improved (shrunk toward mean)
Interval width	Always wider	Can be narrower than classical
Adapts to data	No (fixed penalty for $m$ )	Yes (adapts to variance ratio)
Type S error	Not addressed	Reduced by shrinkage
Type M error	Not addressed	Reduced by shrinkage

Approach 4: Regularizing Priors for Variable Selection

Instead of selecting variables by significance, use priors that encode skepticism about extreme effects (Overfitting and Information Criteria, Iterative Model Improvement):

Ridge-like: $β_{j} \sim N (0, τ^{2})$ — shrinks all coefficients toward zero
Lasso-like: $β_{j} \sim Laplace (0, λ)$ — encourages sparsity
Horseshoe prior: heavy-tailed, allows genuinely large signals while strongly shrinking noise — state of the art for sparse high-dimensional problems

Information Budget ( Iterative Model Improvement)

As models grow more complex, priors need to become tighter to avoid destabilizing estimates. Selecting variable transformations based on significance is effectively choosing a very permissive implicit prior over the model space.

Approach 5: Predictive Model Comparison (Don’t Pick One Model)

Rather than selecting the “best” model by significance, evaluate models on out-of-sample predictive accuracy (Model Comparison, Overfitting and Information Criteria):

PSIS-LOO (Pareto-smoothed importance sampling leave-one-out): efficient approximation of LOO-CV with $\hat{k}$ diagnostics — preferred in practice
WAIC: fully Bayesian information criterion, no Gaussian approximation needed
Stacking: when model comparison is uncertain, combine models rather than selecting one (Iterative Model Improvement):

p_{stack} (\tilde{y} ∣ data) = k \sum w_{k} p_{k} (\tilde{y} ∣ data)

Weights $w_{k}$ are chosen to minimize cross-validation error. This outperforms both single-model selection and traditional Bayesian model averaging.

Approach 6: Projection Predictive Variable Selection (Best Practice)

For your specific workflow of selecting from a large model space, projection predictive variable selection (Piironen & Vehtari, 2017) is the current best practice (Iterative Model Improvement):

Fit a full model with all candidate variables and regularizing priors (e.g., horseshoe)
Project the full model’s predictive distribution onto smaller submodels
Find the minimal submodel that preserves the full model’s predictive performance

This avoids the overfitting that comes from searching through many models independently — the variable selection is guided by the full model’s posterior, not by individual significance tests.

Approach 7: Multiverse Analysis

If you must explore many model specifications, Iterative Model Improvement recommends multiverse analysis: fit all plausible variants and examine how conclusions change. If conclusions are robust across the model space, the specific model choice matters less.

Practical Implications

For your workflow of running hundreds of models:

Stop selecting by significance — p-values from a searched model space are meaningless (Forking Paths and Bayesian Approaches)
Fit one well-specified model with regularizing priors (horseshoe for sparsity) that includes all candidate variables and transformations
Use projection predictive selection to identify the minimal sufficient variable set
Evaluate via PSIS-LOO or WAIC, not in-sample $R^{2}$ or $p$ -values
If comparing groups/conditions, use multilevel models — partial pooling handles multiple comparisons automatically
Report full posterior intervals, not binary significance decisions
Be suspicious of large effects from the model search — they are likely Type M errors

Source Notes

Note	Relevance
Multiple Testing Corrections	Classical corrections: Bonferroni, Holm, BH/FDR
Garden of Forking Paths	Why model search invalidates p-values (Gelman & Loken 2013)
Researcher Degrees of Freedom	Catalogs sources of analytic flexibility
Type S and Type M Errors	Reframes errors: sign and magnitude matter more than Type 1
Multiple Comparisons - Bayesian Perspective	Core paper: multilevel models replace corrections (Gelman et al. 2009)
Partial Pooling as Multiple Comparisons Correction	Formal algebra: shrinkage reduces z-scores adaptively
Forking Paths and Bayesian Approaches	Why Bayesian methods handle data-contingent analysis
Iterative Model Improvement	Stacking, projection predictive selection, multiverse analysis
Overfitting and Information Criteria	Information criteria, regularizing priors
Model Comparison	LOO-CV, PSIS-LOO, WAIC, Bayes factors
multiple2f.pdf	Gelman, Hill & Yajima (2009) — primary source
p_hacking.pdf	Gelman & Loken (2013) — garden of forking paths
StatRethink-Bayes.pdf	Statistical Rethinking Ch. 9 — overfitting and regularization
BDA3.pdf	BDA3 Ch. 7 — model comparison
BayesWorkflow.pdf	Gelman et al. (2020) — Bayesian workflow

Hierarchical Models — the foundational framework for partial pooling (BDA3 Ch. 5)
Hierarchical Linear Models — regression extensions with partial pooling
Bayesian Linear Regression — regularizing priors in regression
Power Analysis and Sample Size — underpowered studies amplify all these problems
Choosing and Building Models — principled model building workflow
Evaluating Fitted Models — posterior predictive checks
Q - Differences Between Frequentist and Bayesian Statistics — the paradigm-level context for why these approaches differ

Gaps

No vault coverage of the horseshoe prior in detail — Bayesian Linear Regression mentions it but consider ingesting Piironen & Vehtari (2017) “Sparsity information and regularization in the horseshoe and other shrinkage priors” for the full treatment
Projection predictive variable selection lacks a dedicated note — only mentioned in Iterative Model Improvement. Consider ingesting Piironen & Vehtari (2017) “Comparison of Bayesian predictive methods for model selection” as a full source
No worked example of multiverse analysis in the vault

Follow-Up Questions

How do I implement projection predictive variable selection in practice (e.g., in R with projpred or in Python)?
What horseshoe prior specification should I use for my specific number of candidate variables?
How does partial pooling interact with cross-validation for model selection?
When is stacking preferred over projection predictive selection?
How should I handle multiple comparisons when the models are non-nested or structurally different?

Second Brain

Explorer

Q: How should I handle multiple comparisons when selecting from hundreds of models?

How should I handle multiple comparisons when selecting from hundreds of models?

Answer

The Problem: Why Significance-Based Selection Fails

Approach 1: Classical Corrections (Limited Utility)

Approach 2: Reframe What Errors Matter

Approach 3: Multilevel Models with Partial Pooling (Recommended)

Approach 4: Regularizing Priors for Variable Selection

Approach 5: Predictive Model Comparison (Don’t Pick One Model)

Approach 6: Projection Predictive Variable Selection (Best Practice)

Approach 7: Multiverse Analysis

Practical Implications

Source Notes

Gaps

Follow-Up Questions

Graph View

Table of Contents

Backlinks

Second Brain

Explorer

Q: How should I handle multiple comparisons when selecting from hundreds of models?

How should I handle multiple comparisons when selecting from hundreds of models?

Answer

The Problem: Why Significance-Based Selection Fails

Approach 1: Classical Corrections (Limited Utility)

Approach 2: Reframe What Errors Matter

Approach 3: Multilevel Models with Partial Pooling (Recommended)

Approach 4: Regularizing Priors for Variable Selection

Approach 5: Predictive Model Comparison (Don’t Pick One Model)

Approach 6: Projection Predictive Variable Selection (Best Practice)

Approach 7: Multiverse Analysis

Practical Implications

Source Notes

Related Concepts

Gaps

Follow-Up Questions

Graph View

Table of Contents

Backlinks