What are some common pitfalls in statistical modeling a data scientist should be aware of?

Summary

The vault identifies eight major pitfall categories: (1) confusing correlation with causation via confounds and selection bias, (2) the garden of forking paths and multiple comparisons, (3) overfitting vs. underfitting, (4) ignoring missing data mechanisms, (5) misusing statistical tests as “golems,” (6) neglecting model checking, (7) computational problems masking model problems, and (8) Type S/M errors from underpowered analyses. Each pitfall has specific remedies grounded in Bayesian workflow, causal DAGs, and proper experimental design.

Answer

1. Confusing Correlation with Causation

The most fundamental pitfall. The vault covers this from two complementary perspectives:

Confounding and spurious association (Spurious Association and Confounds, Statistical Rethinking Ch. 5): When a confound causes both predictor and outcome , bivariate regression shows a “significant” effect that disappears when is included. The classic example: Waffle House density correlates with divorce rate, but both are driven by being a Southern state.

Post-Treatment Bias ( Spurious Association and Confounds)

Controlling for variables caused by the treatment blocks the indirect causal path. If treatment mediator outcome, including the mediator makes the treatment appear ineffective. This connects to the econometric concept of bad controls.

Selection bias (The Selection Problem, MHE Ch. 2): Individuals who receive treatment differ systematically from those who don’t. The observed difference decomposes as:

Selection bias can be so large it reverses the sign of the true effect — hospitals appear harmful because sick people seek them out.

Omitted variables bias (Omitted Variables Bias, MHE Ch. 3): The OVB formula shows that omitting a relevant variable biases the coefficient on included variables by the product of (a) the omitted variable’s effect on the outcome and (b) its correlation with the included variable.

Example: Activity Bias in Advertising ( Activity Bias in Advertising)

Lewis, Rao & Reiley (2011) showed that observational methods overestimate the causal effect of online ads by 160x compared to the RCT. The mechanism: users exposed to ads are inherently more active online, and this activity bias cannot be controlled away no matter how many covariates are included. The Conditional Independence Assumption is fundamentally violated.

Remedy: Draw causal DAGs before modeling. Use the potential outcomes framework to be explicit about what you’re estimating. When possible, use randomized experiments; otherwise, apply IV, DD, or RD with clear identification strategies.


2. The Garden of Forking Paths and Multiple Comparisons

Covered in detail in Q - Handling Multiple Comparisons When Selecting From Hundreds of Models.

The garden of forking paths (Gelman & Loken 2013) describes how the many decision points in data analysis — which variables to include, how to transform them, which subgroups to examine — create an enormous space of possible analyses. Even without intent to deceive, the p-value from the analysis you chose does not account for the analyses you would have chosen with different data (Researcher Degrees of Freedom).

With just 5 binary analytic choices, there are possible analysis paths. The probability that at least one yields is far higher than 5%.

Remedy: Use multilevel models with partial pooling, regularizing priors, and projection predictive variable selection. See the linked Q&A for the full progression of solutions.


3. Overfitting and Underfitting

Overfitting and Information Criteria (Statistical Rethinking Ch. 9) uses the metaphors of Scylla (overfitting) and Charybdis (underfitting):

  • always increases with more parameters — even random predictors improve in-sample fit
  • A 5th-degree polynomial achieves on 6 data points but predicts negative brain volumes between them
  • More complex models learn noise; simpler models miss real patterns

Two families of solutions:

  1. Regularizing priors — informative priors skeptical of extreme parameter values. The Bayesian version of ridge/lasso regression (Bayesian Linear Regression)
  2. Information criteria — score models on estimated out-of-sample predictive accuracy:

Information Budget ( Choosing and Building Models)

As models grow more complex, priors need to become tighter. Even independent priors on individual logistic regression coefficients become strongly informative in the joint prior as the number of predictors increases — pushing predicted probabilities toward 0 or 1.


4. Ignoring Missing Data Mechanisms

Missing Data - Statistical Rethinking emphasizes that the strategy for handling missing data depends on the causal mechanism generating the missingness:

MechanismImplication
MCARComplete-case analysis unbiased (but rare in practice)
MARImputation valid if you condition on the right observables
MNARMust model the missingness mechanism explicitly

Always draw a causal DAG with the missingness indicator as a node. Conditioning on (complete cases) opens collider paths that can induce bias even under MAR.

Remedy: Treat missing values as latent variables with priors. Bayesian imputation naturally propagates uncertainty in imputed values through to parameter posteriors. Use pm.Normal.dist() in PyMC or similar approaches.


5. Using Statistical Tests as “Golems”

Statistical Rethinking - The Golem of Prague (Statistical Rethinking Ch. 1) argues that statistical tests (t-tests, chi-squared, ANOVA) are pre-fabricated golems — powerful within their domain but dangerous when misapplied. Three key insights:

  1. Hypotheses are not models — the mapping between hypotheses, process models, and statistical models is many-to-many. Rejecting a null tells you very little.
  2. Falsification rarely works cleanly — observation error, continuous hypotheses, and the consensual nature of science all undermine naive falsification
  3. Build, don’t test — learn to construct, evaluate, and modify statistical models directly rather than choosing from a flowchart

Comparing Marginal Significance ( Spurious Association and Confounds)

Even if is “significant” and is not, the difference may not be significant. Always compute the contrast directly from the posterior.


6. Neglecting Model Checking

Model Checking (BDA3 Ch. 6) and Evaluating Fitted Models (Bayesian Workflow Sec. 6) describe the iterative process of assessing model fit:

Posterior predictive checking: simulate replicated data from the fitted model and compare to observed data:

If doesn’t “look like” the observed data, the model is missing something.

Prior predictive checking (Choosing and Building Models): simulate data before observing real data to verify that your priors imply sensible data ranges. This catches unreasonable priors before they corrupt inference.

LOO-CV diagnostics (Evaluating Fitted Models): identify hard-to-predict observations (potential outliers or model misfit), check calibration via LOO-PIT, and assess observation influence.

Remedy: Model checking is iterative: identify misfit expand the model check again. Focus on severe tests — checks likely to fail if the model would give misleading answers to the questions you care about.


7. Computational Problems Masking Model Problems

Computational Troubleshooting (Bayesian Workflow Sec. 5) introduces the folk theorem of statistical computing:

When you have computational problems, often there is a problem with your model, not the algorithm.

Divergent transitions, poor mixing, and slow convergence often signal a nonsensical model, not an algorithmic limitation. Common pathologies:

ProblemLikely CauseSolution
Divergent transitionsFunnel geometry in hierarchical modelsNon-centered parameterization
Slow mixingWeakly informative dataAdd reasonable prior information
Label switchingSymmetric modes in mixturesConstrain to identify one mode
MultimodalitySubstantively distinct explanationsStacking; stronger priors

Debug strategy: Move from two directions — simplify the failing model top-down, and build up from a working simple model bottom-up. Where they meet is where the pathology lives.


8. Type S and Type M Errors from Underpowered Analyses

Type S and Type M Errors (Gelman et al. 2009, Sec. 3.1) reframes what goes wrong beyond the classical Type 1/Type 2 dichotomy:

  • Type S (sign) error: claiming an effect is positive when it’s actually negative
  • Type M (magnitude) error: the exaggeration ratio is far from 1

Large Standard Errors Inflate Type M Errors ( ^thm-type-m-inflation)

When the true effect is near zero, an estimator with a large standard deviation is more likely to produce large point estimates. Statistically significant results from underpowered studies are likely exaggerated — large estimates are a byproduct of large standard errors, not evidence of large effects.

This is the mechanism behind the “winner’s curse”: the most impressive result in a set of comparisons is likely the most exaggerated (Power Analysis and Sample Size).

Remedy: Frame questions in terms of sign and magnitude, not “is it zero?” Use multilevel models that naturally control Type S/M errors through shrinkage. Be suspicious of large effects from small samples.

Practical Implications

A checklist for avoiding these pitfalls:

  1. Before modeling: Draw a causal DAG. Include missingness mechanisms. Identify the target estimand.
  2. Building the model: Start simple, build modularly (Choosing and Building Models). Use regularizing priors. Perform prior predictive checks.
  3. Fitting: Monitor convergence diagnostics. If computation fails, suspect the model first (Computational Troubleshooting).
  4. Evaluating: Posterior predictive checks. LOO-CV with PSIS diagnostics. Check calibration.
  5. Comparing models: Use PSIS-LOO or WAIC, not or p-values. Consider stacking rather than selecting one model (Iterative Model Improvement).
  6. Reporting: Full posterior intervals, not binary significance. Compute contrasts directly. Acknowledge uncertainty.

Source Notes

NoteRelevance
Statistical Rethinking - The Golem of PraguePhilosophy: models as golems, build don’t test
Spurious Association and ConfoundsConfounding, post-treatment bias, masked relationships
Omitted Variables BiasOVB formula and bias direction
The Selection ProblemPotential outcomes framework, selection bias decomposition
Activity Bias in AdvertisingDramatic real-world example of confounding (160x overestimate)
Garden of Forking PathsData-contingent analysis invalidates p-values
Researcher Degrees of FreedomSources of analytic flexibility
Type S and Type M ErrorsSign and magnitude errors from underpowered studies
Multiple Testing CorrectionsClassical corrections: Bonferroni, FDR, Holm
Partial Pooling as Multiple Comparisons CorrectionHierarchical models as adaptive correction
Overfitting and Information CriteriaBias-variance tradeoff, information criteria, regularization
Model CheckingPosterior predictive checks (BDA3 Ch. 6)
Evaluating Fitted ModelsWorkflow for model evaluation (Bayesian Workflow Sec. 6)
Choosing and Building ModelsModular model construction, prior predictive checks
Computational TroubleshootingFolk theorem, debugging strategies
Missing Data - Statistical RethinkingMissingness mechanisms, Bayesian imputation
StatRethink-Bayes.pdfStatistical Rethinking (McElreath)
BDA3.pdfBayesian Data Analysis 3rd ed. (Gelman et al.)
BayesWorkflow.pdfBayesian Workflow (Gelman et al. 2020)
Mostly Harmless Econometrics.pdfMostly Harmless Econometrics (Angrist & Pischke)
p_hacking.pdfGarden of Forking Paths (Gelman & Loken 2013)
multiple2f.pdfMultiple Comparisons (Gelman, Hill & Yajima 2009)
ssrn-2080235.pdfActivity Bias (Lewis, Rao & Reiley 2011)

Gaps

  • No dedicated note on Simpson’s paradox — mentioned implicitly via confounding but no formal treatment with examples
  • No note on p-value misinterpretation specifically (e.g., common misreadings of what means)
  • No coverage of data leakage in train/test splits or cross-validation
  • No coverage of survivorship bias as a specific selection mechanism
  • Limited treatment of multicollinearity — mentioned briefly in Spurious Association and Confounds but no dedicated note with VIF diagnostics or remedies
  • No coverage of ecological fallacy (group-level associations not holding at individual level)

Follow-Up Questions

  • How should I structure a causal DAG for my specific modeling problem?
  • What prior predictive checks should I run before fitting a Bayesian model?
  • How do I diagnose and fix divergent transitions in Stan/PyMC?
  • What is the right missing data strategy when the mechanism is MNAR?
  • How does regularization interact with causal identification (e.g., can shrinkage introduce bias in treatment effect estimates)?