How should I handle multiple comparisons when selecting from hundreds of models?
Summary
Running hundreds of models and selecting variables or transformations based on statistical significance is a textbook instance of the Garden of Forking Paths problem — the p-values from selected models are invalid because the selection procedure is data-contingent. The vault contains a progression of solutions: classical corrections (FDR) are a starting point but become impractical at scale. The recommended approach is to use multilevel models with partial pooling, regularizing priors, and projection predictive variable selection rather than significance-based model selection.
Answer
The Problem: Why Significance-Based Selection Fails
Running hundreds of models and picking variables based on creates a massive multiplicity problem. With independent tests at (Multiple Testing Corrections):
With 100 models this probability is 99.4%; with 1000 it is essentially 100%.
But the problem is deeper than the raw count of tests. Gelman & Loken (2013) identify three levels of data-contingent analysis:
- Pre-specified test: with fixed in advance
- Researcher degrees of freedom: — a single test is run, but different data would have led to a different test
- Explicit search: — trying many tests and reporting the best
Selecting variable transformations and model specifications based on significance is structurally identical to case 3, even without intent to deceive (Researcher Degrees of Freedom). The combinatorial explosion is enormous: choosing which variables to include, how to transform them, which interactions to test, and which model form to use creates a space far larger than the number of models explicitly fit.
The Winner's Curse
The model or transformation you select as “most significant” is likely the one with the largest Type M error. When the true effect is near zero, an estimator with a large standard deviation is more likely to produce large point estimates — statistically significant results from a heavily searched model space are likely exaggerated, not evidence of large effects (^thm-type-m-inflation).
Approach 1: Classical Corrections (Limited Utility)
Multiple Testing Corrections adjust significance thresholds:
| Method | Threshold for | Controls |
|---|---|---|
| Bonferroni | FWER (any false positive) | |
| Holm (step-down) | Adaptive, more powerful | FWER |
| Benjamini-Hochberg | Adaptive, | FDR (proportion of false positives) |
Limitations for model selection:
- Bonferroni becomes extremely conservative — with hundreds of models, the threshold drops to levels where almost nothing is significant (Multiple Testing Corrections)
- These methods adjust thresholds but leave point estimates unchanged — they don’t address Type M errors (exaggerated effect sizes)
- They assume you’re testing pre-specified hypotheses, not searching through a model space
- Classical corrections can actually increase Type S and Type M error rates by further reducing power (Type S and Type M Errors)
When Classical Corrections Are Appropriate
Bonferroni/Holm: few tests, each individually important, zero tolerance for false positives. BH/FDR: many parallel tests (genomics, imaging) with batch follow-up planned. Neither is designed for the model search problem (Multiple Testing Corrections).
Approach 2: Reframe What Errors Matter
Rather than asking “is this significant?”, ask (Type S and Type M Errors):
- Type S (sign) error: Do I have the direction of the effect right?
- Type M (magnitude) error: How exaggerated is my estimate likely to be? The exaggeration ratio is .
Example: Eight Schools Simulation ( ^ex-eight-schools-sim)
In 1000 simulations of 8 school effects with 28 pairwise comparisons each:
- Classical analysis finds at least one significant comparison in 47% of simulations
- Of those significant results, only 63% have the correct sign — a 37% Type S error rate
- Bayesian hierarchical analysis: significant in only 5% of simulations, with 89% correct sign
Approach 3: Multilevel Models with Partial Pooling (Recommended)
Gelman, Hill & Yajima (2009) argue that multilevel models replace classical corrections entirely. The mechanism is partial pooling:
For a normal-normal hierarchical model, the posterior z-score for any comparison is (^thm-zscore-shrinkage):
The shrinkage factor is always , and it adapts to the data:
- When groups are similar ( small): strong shrinkage, aggressive correction
- When groups are genuinely different ( large): weak shrinkage, little correction needed
| Property | Classical (Bonferroni/FDR) | Multilevel (Partial Pooling) |
|---|---|---|
| Adjusts | Threshold / interval width | Point estimates + intervals |
| Point estimates | Unchanged | Improved (shrunk toward mean) |
| Interval width | Always wider | Can be narrower than classical |
| Adapts to data | No (fixed penalty for ) | Yes (adapts to variance ratio) |
| Type S error | Not addressed | Reduced by shrinkage |
| Type M error | Not addressed | Reduced by shrinkage |
Approach 4: Regularizing Priors for Variable Selection
Instead of selecting variables by significance, use priors that encode skepticism about extreme effects (Overfitting and Information Criteria, Iterative Model Improvement):
- Ridge-like: — shrinks all coefficients toward zero
- Lasso-like: — encourages sparsity
- Horseshoe prior: heavy-tailed, allows genuinely large signals while strongly shrinking noise — state of the art for sparse high-dimensional problems
Information Budget ( Iterative Model Improvement)
As models grow more complex, priors need to become tighter to avoid destabilizing estimates. Selecting variable transformations based on significance is effectively choosing a very permissive implicit prior over the model space.
Approach 5: Predictive Model Comparison (Don’t Pick One Model)
Rather than selecting the “best” model by significance, evaluate models on out-of-sample predictive accuracy (Model Comparison, Overfitting and Information Criteria):
- PSIS-LOO (Pareto-smoothed importance sampling leave-one-out): efficient approximation of LOO-CV with diagnostics — preferred in practice
- WAIC: fully Bayesian information criterion, no Gaussian approximation needed
- Stacking: when model comparison is uncertain, combine models rather than selecting one (Iterative Model Improvement):
Weights are chosen to minimize cross-validation error. This outperforms both single-model selection and traditional Bayesian model averaging.
Approach 6: Projection Predictive Variable Selection (Best Practice)
For your specific workflow of selecting from a large model space, projection predictive variable selection (Piironen & Vehtari, 2017) is the current best practice (Iterative Model Improvement):
- Fit a full model with all candidate variables and regularizing priors (e.g., horseshoe)
- Project the full model’s predictive distribution onto smaller submodels
- Find the minimal submodel that preserves the full model’s predictive performance
This avoids the overfitting that comes from searching through many models independently — the variable selection is guided by the full model’s posterior, not by individual significance tests.
Approach 7: Multiverse Analysis
If you must explore many model specifications, Iterative Model Improvement recommends multiverse analysis: fit all plausible variants and examine how conclusions change. If conclusions are robust across the model space, the specific model choice matters less.
Practical Implications
For your workflow of running hundreds of models:
- Stop selecting by significance — p-values from a searched model space are meaningless (Forking Paths and Bayesian Approaches)
- Fit one well-specified model with regularizing priors (horseshoe for sparsity) that includes all candidate variables and transformations
- Use projection predictive selection to identify the minimal sufficient variable set
- Evaluate via PSIS-LOO or WAIC, not in-sample or -values
- If comparing groups/conditions, use multilevel models — partial pooling handles multiple comparisons automatically
- Report full posterior intervals, not binary significance decisions
- Be suspicious of large effects from the model search — they are likely Type M errors
Source Notes
| Note | Relevance |
|---|---|
| Multiple Testing Corrections | Classical corrections: Bonferroni, Holm, BH/FDR |
| Garden of Forking Paths | Why model search invalidates p-values (Gelman & Loken 2013) |
| Researcher Degrees of Freedom | Catalogs sources of analytic flexibility |
| Type S and Type M Errors | Reframes errors: sign and magnitude matter more than Type 1 |
| Multiple Comparisons - Bayesian Perspective | Core paper: multilevel models replace corrections (Gelman et al. 2009) |
| Partial Pooling as Multiple Comparisons Correction | Formal algebra: shrinkage reduces z-scores adaptively |
| Forking Paths and Bayesian Approaches | Why Bayesian methods handle data-contingent analysis |
| Iterative Model Improvement | Stacking, projection predictive selection, multiverse analysis |
| Overfitting and Information Criteria | Information criteria, regularizing priors |
| Model Comparison | LOO-CV, PSIS-LOO, WAIC, Bayes factors |
| multiple2f.pdf | Gelman, Hill & Yajima (2009) — primary source |
| p_hacking.pdf | Gelman & Loken (2013) — garden of forking paths |
| StatRethink-Bayes.pdf | Statistical Rethinking Ch. 9 — overfitting and regularization |
| BDA3.pdf | BDA3 Ch. 7 — model comparison |
| BayesWorkflow.pdf | Gelman et al. (2020) — Bayesian workflow |
Related Concepts
- Hierarchical Models — the foundational framework for partial pooling (BDA3 Ch. 5)
- Hierarchical Linear Models — regression extensions with partial pooling
- Bayesian Linear Regression — regularizing priors in regression
- Power Analysis and Sample Size — underpowered studies amplify all these problems
- Choosing and Building Models — principled model building workflow
- Evaluating Fitted Models — posterior predictive checks
- Q - Differences Between Frequentist and Bayesian Statistics — the paradigm-level context for why these approaches differ
Gaps
- No vault coverage of the horseshoe prior in detail — Bayesian Linear Regression mentions it but consider ingesting Piironen & Vehtari (2017) “Sparsity information and regularization in the horseshoe and other shrinkage priors” for the full treatment
- Projection predictive variable selection lacks a dedicated note — only mentioned in Iterative Model Improvement. Consider ingesting Piironen & Vehtari (2017) “Comparison of Bayesian predictive methods for model selection” as a full source
- No worked example of multiverse analysis in the vault
Follow-Up Questions
- How do I implement projection predictive variable selection in practice (e.g., in R with
projpredor in Python)? - What horseshoe prior specification should I use for my specific number of candidate variables?
- How does partial pooling interact with cross-validation for model selection?
- When is stacking preferred over projection predictive selection?
- How should I handle multiple comparisons when the models are non-nested or structurally different?