What are some differences between frequentist and Bayesian statistics?

Summary

The core divide is philosophical: frequentists treat probability as long-run frequency and parameters as fixed unknowns to be estimated, while Bayesians treat probability as a degree of belief and parameters as random variables with distributions. This leads to practical differences in how uncertainty is quantified (confidence intervals vs. credible intervals), how prior knowledge enters the analysis, how multiple parameters are handled, and how models are compared. The two frameworks converge asymptotically but diverge sharply in small-sample, high-dimensional, or hierarchical settings.

Answer

1. What Is Probability?

This is the foundational divide (Probability and Bayesian Inference, BDA3 Ch. 1):

FrameworkProbability means…Parameters are…
FrequentistLong-run frequency of events in repeated experimentsFixed but unknown constants
BayesianDegree of belief (uncertainty) in any propositionRandom variables with distributions

In the frequentist view, asking “what is the probability that ?” is meaningless — is fixed, so the probability is either 0 or 1. In the Bayesian view, is a perfectly sensible statement about our uncertainty given data (Probability and Bayesian Inference).

Statistical Rethinking - The Golem of Prague frames this as: frequentists treat “randomness” as a property of the world (coins are either fair or not); Bayesians treat randomness as a property of information (we’re uncertain about whether the coin is fair).


2. The Core Mechanics

Frequentist: fit a model by maximizing the likelihood ; report a point estimate and standard errors derived from the sampling distribution of .

Bayesian: combine the prior with the likelihood to obtain the posterior via Bayes’ theorem (Probability and Bayesian Inference, BDA3 Ch. 1):

The posterior is the complete Bayesian answer — not a point estimate but a full distribution over plausible values of .


3. Uncertainty Quantification: Confidence vs. Credible Intervals

This is the most practically important difference (Asymptotics and Frequentist Connections, BDA3 Ch. 4):

Frequentist 95% Confidence Interval

An interval constructed so that, if the experiment were repeated many times, 95% of such intervals would contain the true . This says nothing about the probability that lies in any particular interval — it’s a property of the procedure, not the realized interval.

Bayesian 95% Credible Interval (HPDI or PI)

An interval such that . This is a direct probability statement: given the observed data, there is a 95% probability that lies in this interval (Posterior Sampling and Summarization).

In practice, many researchers interpret confidence intervals as credible intervals (which is technically wrong frequentism). Bayesian credible intervals give what users actually want to know.

They Converge Asymptotically ( Asymptotics and Frequentist Connections)

As , the posterior concentrates around the MLE and the data dominate the prior. Under regularity conditions (Bernstein-von Mises theorem): . Bayesian credible intervals and frequentist confidence intervals coincide in large samples. They diverge most in small-, high-dimensional, or hierarchical settings.


4. Prior Information

The biggest practical distinction in finite samples.

Frequentist methods have no formal mechanism for incorporating prior knowledge (though regularization methods like ridge and lasso are implicitly Bayesian). Bayesian methods explicitly require a prior (Single-Parameter Models, BDA3 Ch. 2):

Prior typeDescriptionWhen to use
InformativeEncodes genuine domain knowledge (e.g., cancer rates from neighboring counties)Strong prior evidence available
Weakly informativeConstrains to reasonable ranges without dominating the likelihoodDefault for most applied work
Noninformative / flat”Let the data speak” — uniform priorLarge ; recovers frequentist MLE
Jeffreys’ prior — invariant to reparameterizationReference analysis

The posterior mean in the Normal-Normal model illustrates the prior-data compromise:

This is a precision-weighted average of the prior mean and the sample mean — with the prior’s influence shrinking as grows (Gelman et al., BDA3 Ch. 2).

Regularizing Priors as Implicit Frequentist Corrections ( Bayesian Linear Regression)

Ridge regression ( penalty) is equivalent to a Bayesian Normal prior on coefficients. Lasso () is equivalent to a Laplace prior. The horseshoe prior is a state-of-the-art Bayesian regularizer that allows large signals while strongly shrinking noise — and has no natural frequentist counterpart. Using priors is not a weakness; it’s often the only principled way to handle settings.


5. What You Report: Point Estimates vs. Full Posteriors

Frequentist: report and a standard error (or confidence interval). Uncertainty is summarized by the sampling distribution of over hypothetical repeated experiments.

Bayesian: report the full posterior . Point summaries are optional (Posterior Sampling and Summarization, Statistical Rethinking Ch. 3):

  • Median minimizes expected absolute loss
  • Mean minimizes expected quadratic loss
  • Mode (MAP) minimizes zero-one loss

McElreath’s insight: “you rarely need a point estimate. The entire posterior distribution is the Bayesian answer.” This matters most when the posterior is skewed or multimodal — in those cases, any single-number summary is misleading.

The posterior predictive distribution propagates full uncertainty into predictions:

Frequentist prediction intervals account for parameter uncertainty only approximately (via plug-in or delta method).


6. Handling Multiple Parameters and Nuisance Parameters

Frequentist: nuisance parameters are profiled out (maximize over them), or eliminated via conditioning or sufficiency. The sampling distribution of profile likelihood estimators may be complex.

Bayesian: marginalize over nuisance parameters by integration (Multiparameter Models, BDA3 Ch. 3):

This is conceptually clean but computationally demanding — motivating MCMC and variational methods. The practical payoff: uncertainty about nuisance parameters flows into uncertainty about parameters of interest automatically. With simulation, this reduces to examining marginals of the joint posterior draws.


7. Hierarchical / Multilevel Settings

This is where Bayesian methods most clearly dominate frequentist alternatives (Hierarchical Models, BDA3 Ch. 5).

Frequentist: fixed effects (no pooling) or random effects (complete pooling). The frequentist random effects estimator requires approximations that become unreliable with few groups.

Bayesian: partial pooling arises naturally from the hierarchical model:

The posterior for each borrows strength from all groups. The degree of pooling is determined by the variance ratio — inferred from the data, not pre-specified (Partial Pooling as Multiple Comparisons Correction).

Multiple Comparisons - Bayesian Perspective shows this also handles multiple comparisons: classical corrections adjust thresholds post-hoc; Bayesian partial pooling adjusts the estimates themselves, reducing Type S and Type M errors simultaneously.


8. Model Comparison

Frequentist: likelihood ratio tests, AIC, adjusted , -tests. These penalize model complexity by the raw number of parameters .

Bayesian: WAIC, LOO-CV (PSIS-LOO), and Bayes factors (Model Comparison, Overfitting and Information Criteria):

  • WAIC uses the effective number of parameters (which differs from for hierarchical models with partial pooling)
  • PSIS-LOO approximates leave-one-out cross-validation with diagnostics for reliability
  • Bayes factors integrate over all parameters, penalizing complexity automatically — but are sensitive to priors on parameters

AIC and WAIC Converge ( Overfitting and Information Criteria)

WAIC converges to AIC when priors are flat and the posterior is Gaussian. The Bayesian WAIC is more general: it uses the full posterior and makes no Gaussian approximation, giving correct effective parameter counts for hierarchical models.


9. Significance Testing and Multiple Comparisons

Frequentist: hypothesis tests produce binary decisions (reject / fail to reject) based on the p-value . The p-value is a property of the data, not of (Forking Paths and Bayesian Approaches).

Bayesian: no null hypothesis rejection. Instead:

  • Report the posterior probability that the effect is positive:
  • Report the full posterior distribution over effect size
  • Use posterior predictive checks to assess model fit (Model Checking)

Forking Paths and Bayesian Approaches emphasizes why this matters: the p-value’s validity depends on the sampling distribution of the test statistic under repeated use of the same procedure. When the procedure is data-contingent (as in model search), the p-value is invalid. A Bayesian posterior probability remains a coherent statement regardless of how the analysis was chosen.


Practical Implications: When to Use Which

SituationLean FrequentistLean Bayesian
Large , simple model✓ (CLT holds, prior irrelevant)Either works
Small ✓ (priors encode structure, uncertainty propagated correctly)
Many parameters relative to ✓ (regularizing priors essential)
Hierarchical/grouped dataFragile✓ (partial pooling natural)
Prediction with uncertaintyApproximate✓ (posterior predictive)
Multiple comparisonsCorrections needed✓ (partial pooling handles it structurally)
Preregistered RCT✓ (classical inference valid)Either works
Need interpretable probability statements✓ (credible intervals are what users want)
Communicating to non-statisticiansRisky (CI misinterpretation)✓ (credible interval is intuitive)

Source Notes

NoteRelevance
Probability and Bayesian InferenceCore formula: Bayes’ theorem; three steps of Bayesian analysis
Asymptotics and Frequentist ConnectionsBernstein-von Mises theorem; when the two frameworks converge
Single-Parameter ModelsPosterior as prior-data compromise; conjugate priors
Posterior Sampling and SummarizationCredible intervals, HPDI, posterior predictive distribution
Multiparameter ModelsMarginalizing over nuisance parameters
Bayesian Linear RegressionRegularizing priors; connection to ridge/lasso/horseshoe
Hierarchical ModelsPartial pooling; where Bayesian methods dominate
Overfitting and Information CriteriaWAIC vs. AIC; regularizing priors vs. model selection
Statistical Rethinking - The Golem of PraguePhilosophy: probability as property of information
Forking Paths and Bayesian ApproachesWhy p-values fail under data-contingent analysis
Multiple Comparisons - Bayesian PerspectiveHierarchical models replacing classical corrections
Regression and the CEFFrequentist regression: best linear approximation to CEF
BDA3.pdfBDA3 Chs. 1-5 — the canonical Bayesian reference
StatRethink-Bayes.pdfStatistical Rethinking — accessible Bayesian perspective
Mostly Harmless Econometrics.pdfFrequentist regression from the econometrics perspective
multiple2f.pdfGelman et al. (2009) — Bayesian superiority for multiple comparisons
p_hacking.pdfGelman & Loken (2013) — frequentist p-values under model search

Gaps

  • No dedicated note on the likelihood principle — the Bayesian-frequentist divide has deep roots in whether inference should respect the likelihood principle (Birnbaum 1962); the vault has no formal treatment of this
  • No coverage of fiducial inference or neo-Fisherian approaches that occupy middle ground
  • No treatment of objective Bayes (Jeffreys priors, reference priors) beyond a brief mention in Single-Parameter Models
  • No formal treatment of the Neyman-Pearson framework (Type 1/Type 2 errors, power) vs. Fisherian p-values — two distinct frequentist traditions that are often conflated
  • Limited coverage of empirical Bayes — mentioned in passing in Partial Pooling as Multiple Comparisons Correction but no dedicated note

Follow-Up Questions

  • When should I use informative priors, and how do I choose them?
  • What is the posterior predictive check workflow in PyMC or Stan?
  • How does MCMC actually work — what are the diagnostics I should monitor?
  • How do Bayesian and frequentist approaches compare for causal inference specifically?
  • What is empirical Bayes, and how does it relate to hierarchical models?