Permutation Tests and Exact Inference

Summary

A permutation test computes a -value by re-labeling the data under an exchangeability assumption and recomputing the statistic across all (or many) relabelings. Under Fisher’s sharp null of no effect the FRT is numerically identical to the permutation test — but the FRT is justified by the randomization of the design rather than by exchangeability of outcomes, so it covers a broader class of nulls and designs. Permutation/randomization tests deliver exact finite-sample -values under the sharp null and only asymptotic validity under weak nulls (where studentization is required). The bootstrap is the other resampling route to weak-null inference; relative to it, the FRT’s edge is finite-sample exactness under the sharp null.

Overview

The permutation test (Pitman 1937; Hoeffding 1952; Romano 1990) reasons: if the group labels are exchangeable under the null, the observed statistic should look typical among all label permutations. Concretely, fix the outcomes, enumerate label permutations , recompute , and report the right-tail fraction . When is huge, draw i.i.d. (Monte Carlo permutation test) — still valid up to Monte Carlo error.

Wu & Ding stress the conceptual difference from the FRT. The classical permutation test assumes the outcomes are exchangeable. The FRT instead takes the random treatment assignment as the source of inference on fixed potential outcomes, assuming no exchangeability (Kempthorne & Doerfler 1969). The two coincide under the sharp null of no effect, because then all imputed potential outcomes equal and re-assigning is the same as permuting labels. In general the FRT is strictly more general.

Main Content

Permutation test ^def-permutation

Given a statistic and an exchangeability hypothesis, the permutation -value is , where recomputes after applying label permutation to the (fixed) data. Approximated by random sampling of permutations when full enumeration is infeasible.

Exact vs asymptotic validity ^thm-exact-vs-asymptotic

Under the sharp null, the permutation/randomization -value is finite-sample exact: for all , any statistic, any data-generating process — no large-sample approximation. Under a weak null the statistic’s randomization distribution generally differs from its true sampling distribution, so validity holds only asymptotically and only for a proper (studentized) statistic (see Studentized Randomization Tests, Theorem 1). Many classical parametric/non-parametric tests (Eden–Yates, Pitman, Kempthorne, Box–Andersen, Bradley, Lehmann) are approximations to the exact permutation/randomization test.

Studentization rescues permutation tests under heteroscedasticity ^def-student-permutation

The need to studentize permutation statistics is classical (Neuhaus 1993; Janssen 1997, 1999; Chung & Romano 2013): in the two-sample Behrens–Fisher problem an unstudentized statistic gives an invalid permutation test under unequal variances, while a studentized one is asymptotically valid because it is asymptotically pivotal. Wu & Ding’s twist: in the finite-population design-based setting the studentized is not asymptotically pivotal — instead its sampling distribution is stochastically dominated by the (pivotal ) randomization distribution, giving a conservative rather than exact asymptotic test.

Permutation/FRT vs the bootstrap ^thm-vs-bootstrap

The bootstrap is the other resampling method for weak nulls (Babu & Singh 1983; Hall 1988 use studentization for second-order accuracy; Imbens & Menzel 2018 fuse the bootstrap with finite-population causal inference). Relative to the bootstrap, the FRT/permutation test’s distinctive advantage is being finite-sample exact under the sharp null. Both can target weak nulls asymptotically; the FRT additionally inherits exactness whenever a compatible sharp null is used for imputation. Studentization aids first-order accuracy (type I control) for the FRT, whereas in the bootstrap it is traditionally used for second-order accuracy.

Examples

Monte Carlo permutation -value. With and possible label splits, full enumeration is fine, but for it is astronomical. Instead draw random permutations , compute for each, and estimate

The “+1” (counting the observed assignment) keeps the test valid. With , a true is estimated with standard error . Wu & Ding use Monte Carlo draws for their applied -values and permutations per run in simulations.

Connections

See Also