Type S and Type M Errors

Summary

Type S (sign) and Type M (magnitude) errors reframe statistical error beyond the classical Type 1/Type 2 dichotomy. A Type S error occurs when we get the direction of an effect wrong; a Type M error occurs when we substantially over- or under-estimate its size. These errors are more practically relevant than Type 1 errors in social science, where effects are rarely exactly zero, and they are exacerbated by underpowered studies.

Overview

The classical multiple comparisons framework focuses on Type 1 error: the probability of rejecting when it is true. Gelman & Tuerlinckx (2000) argue this is the wrong concern for most applied research, because:

  1. We rarely believe the null hypothesis is exactly true — in social science, there is almost always some difference between groups
  2. The practical question is not “is the effect zero?” but “what is the effect’s sign and magnitude?”

Definitions

Definition: Type S Error (Gelman & Tuerlinckx, 2000)

A Type S (sign) error occurs when we claim an effect is in one direction (e.g., ) when the true effect is in the opposite direction (). For comparisons between groups, this means claiming when in fact .

Definition: Type M Error (Gelman & Tuerlinckx, 2000)

A Type M (magnitude) error occurs when the estimated effect size substantially differs from the true effect size. The exaggeration ratio is . A Type M error is particularly concerning when a treatment effect is reported as near zero when it is actually large, or reported as large when it is near zero.

Why These Errors Matter More Than Type 1

Consider site-specific treatment effects for with hypotheses vs. :

  • Type 1 error asks: do we reject when ? But when do we truly believe is exactly zero?
  • Type S error asks: do we claim the effect is positive when it’s actually negative? This has direct policy consequences.
  • Type M error asks: is our estimate wildly off from the truth? This determines whether our conclusions are actionable.

The Role of Uncertainty

Theorem: Large Standard Errors Inflate Type M Errors (Gelman, Hill & Yajima, 2009, Sec. 3.1)

When the true effect is near zero, an estimator with a large standard deviation is more likely to produce large point estimates (in absolute value) than an estimator with a small standard deviation. Statistically significant results from underpowered studies are therefore likely to be exaggerated — large estimates are a byproduct of large standard errors, not evidence of large effects.

This is illustrated by comparing two sampling distributions when the true effect is zero:

  • With : estimates exceeding occur 5% of the time, but their magnitude is modest
  • With : significant estimates are larger in absolute value, giving the misleading impression of a large effect

Example: Subgroup Analysis (IHDP, Sec. 4.5)

Setup: When moving from main effects to subgroup effects (e.g., site birth-weight group), sample sizes per cell decrease and standard errors increase.

Result: Classical estimates become more volatile — the lighter low-birth-weight group shows large swings across sites. The Bonferroni correction only widens intervals further, reinforcing the unreliability.

Bayesian solution: The multilevel model shrinks the volatile estimates toward the group mean, yielding more stable and reliable subgroup-specific estimates. None of the Bayesian 95% intervals come close to covering zero, contrasting with 4 of the Bonferroni-adjusted classical intervals that include zero.

Connection to Multiple Comparisons

When we switch from examining main effects to subgroup effects or pairwise comparisons:

  • Sample sizes per comparison decrease
  • Standard errors increase
  • Type M errors become more likely — we are more likely to see large, misleading estimates
  • Type S errors become more likely — with wide sampling distributions, getting the sign wrong is more probable

Classical multiple comparisons corrections (Bonferroni, FDR) address Type 1 error by widening intervals or raising the significance threshold. But this further reduces power, potentially increasing Type S and Type M error rates for real effects.

Multilevel models address all three error types simultaneously:

  • Partial pooling reduces the rate of “significant” comparisons (like classical corrections)
  • But it does so by improving point estimates through shrinkage, not by inflating uncertainty
  • Type S error rates decrease because shrunk estimates are more stable
  • Type M error rates decrease because extreme estimates are pulled toward the grand mean

Practical Guidance

  1. Frame questions in terms of sign and magnitude, not “is it zero?”
  2. Use multilevel models when comparing many groups — they naturally control Type S and Type M errors
  3. Be suspicious of large estimates from small samples — they are likely Type M errors
  4. Don’t rely solely on statistical significance — a significant result from an underpowered study may have the wrong sign

See Also