The Golem of Prague

Summary

Chapter 1 of Statistical Rethinking argues that statistical models are like golems — powerful but mindless constructs that follow instructions literally. Instead of choosing among pre-made tests, scientists should learn to build and understand their own models. Three key arguments: (1) hypotheses are not models, (2) falsification rarely works cleanly, and (3) we should build rather than test.

Statistical Golems

Statistical tests (t-tests, chi-squared, ANOVA, etc.) are pre-fabricated golems: powerful within their domain but dangerous when misapplied. Scientists often choose tests from a flowchart without understanding the underlying model. McElreath argues for golem engineering — learning to construct, evaluate, and modify statistical models directly.

Hypotheses Are Not Models

A crucial insight: the mapping between hypotheses, process models, and statistical models is many-to-many:

  1. Any statistical model (M) may correspond to more than one process model (P)
  2. Any hypothesis (H) may correspond to more than one process model (P)
  3. Any statistical model (M) may correspond to more than one hypothesis (H)

Rejecting a Null Tells You Little

When we reject a null hypothesis, we haven’t confirmed the alternative — the same statistical model that fits the “null” process could also fit a “selection” process. Model comparison across multiple non-null models is far more informative.

Why Falsification Rarely Works

Three problems with naive falsification:

  1. Observation error — measurements are imprecise; “black swan” detections are probabilistic
  2. Continuous hypotheses — most scientific hypotheses are about degree, not binary truth (e.g., “80% of swans are white”)
  3. Falsification is consensual — the scientific community argues toward consensus; it’s not a clean logical operation

Three Tools for Golem Engineering

  1. Bayesian data analysis — treating “randomness” as a property of information, not of the world; the golem is random, not the coin
  2. Multilevel models — parameters all the way down; four reasons to use them: (a) adjust for repeat sampling, (b) adjust for imbalance, (c) study variation, (d) avoid averaging
  3. Model comparison using information criteria — AIC, DIC, WAIC; navigating between overfitting (Scylla) and underfitting (Charybdis)

See Also