Studentized Randomization Tests
Summary
The central result of Wu & Ding (2021): running the FRT with a studentized (Wald-type) statistic — the estimated contrast scaled by a heteroscedasticity-robust covariance estimator — yields a test with dual validity: it is finite-sample exact under the sharp null (a free property of any FRT) and asymptotically conservative (valid type I error) under the weak null . It is model-free and agnostic to treatment-effect heterogeneity. Non-studentized statistics (, the statistic, the Box-type statistic ) lack this and can fail. Practical recommendation: always use .
Overview
Recall (from Sharp vs Weak Null Hypotheses) the weak null . The estimator of arm means satisfies with , where . The true depends on the unestimable cross-arm covariances , ; but the diagonal “Neyman” estimator
is conservative (it over-estimates the true variance). Studentizing by is what makes the FRT robust to variance heterogeneity.
Proposition 4 (the criterion). The FRT with statistic controls type I error at any level for if, under the null, the sampling distribution of is stochastically dominated by its randomization distribution (written ). A statistic with this property is called proper. The whole game is to find a that is proper; is.
Main Content
The studentized statistic
A Wald-type quadratic form using the conservative robust covariance estimator for . In the treatment-control case () it reduces to the squared studentized ATE,
i.e. the square of Neyman’s ATE estimate over its (heteroscedasticity-robust) standard error.
Theorem 1 — is proper (dual validity)
Under Assumption 1, the sampling distribution satisfies, under ,
a weighted sum of independent variates with weights at most 1. Under the stronger Assumption 2, with , the randomization distribution satisfies
Because stochastically dominates (weights ), the FRT with asymptotically conservatively controls type I error under the weak null. Combined with finite-sample exactness under the sharp null, is robust on two classes of nulls.
Box-type statistic is NOT proper
The Box-type statistic (with ) has asymptotic-mean ratio but this is necessary, not sufficient, for the stochastic-dominance criterion of Proposition 4. Hence the FRT with cannot control type I error in general, even asymptotically. Exceptions: equal variances, or a one-dimensional hypothesis ( a row vector, where ).
OLS statistic is NOT proper; Huber–White repairs it
The classical regression statistic uses a pooled variance , which presumes homoscedasticity — incompatible with the potential-outcomes framework. Under , with weights that can exceed 1, so is improper (fails type I control under heteroscedasticity). Replacing by the Huber–White estimator gives , which is asymptotically equivalent to (since ). Covariate adjustment / regression-based inference must pair the robust (HW) covariance with the FRT.
Two valid tests, one statistic ^def-two-tests
Theorem 1 yields two asymptotically conservative tests from : (a) the FRT — compare observed to its randomization distribution (also finite-sample exact for ); (b) the approximation — reject if exceeds the quantile of (no Monte Carlo). The FRT has the extra finite-sample-exactness property; in simulations and applications it tends to be slightly more conservative than the approximation.
Practical recommendation ^summary-recommendation
Use the FRT with the studentized statistic (equivalently for ATE, or the Huber–White-robust ). It is model-free, finite-sample exact under the sharp null, asymptotically valid under the weak null, robust to treatment-effect heterogeneity and unequal variances, and extends to stratified, clustered, factorial, ANOVA, trend-test, and binary-outcome designs. Avoid the non-studentized , plain , and Box-type for weak nulls except in the narrow special cases (equal variances; balanced; binary outcomes under the equal-means null).
Examples
Charness & Gneezy (2009), financial incentives for exercise (paper’s Sec. 7.1). college students, arms (no / small / large incentive), ; outcome = change in weekly gym visits. Sample means and sample variances — clearly heteroscedastic. Testing the weak null at the 1% level, the FRT-with- and its approximation give congruent, significant results, whereas the -based test is overly conservative for this data (its -values inflate). Guided by the theory, one trusts the -values over the ones. (A tiny jitter was added to outcomes because many were exactly 0, to avoid degenerate permuted groups.)
Sanity computation of . Two arms, , , . Then (so ). Compare to the randomization distribution of (asymptotically , whose quantile is ) — borderline reject at 5%. Crucially the denominator used arm-specific variances, not a pooled one.
Connections
- Fisher Randomization Test and the Sharp Null — supplies the finite-sample-exact half of the dual validity and the imputation machinery.
- Sharp vs Weak Null Hypotheses — explains the heteroscedasticity problem that studentization solves.
- Randomization Inference - Overview — finite-population asymptotics and the conservative estimator .
- Permutation Tests and Exact Inference — studentization is the same device that fixes permutation tests under unequal variances (Behrens–Fisher).
See Also
- Multiple Testing Corrections — when testing several contrasts jointly or in sequence.
- Power Analysis and Sample Size — the conservative test sacrifices some power for valid type I control.
- Differences-in-Differences and Synthetic Control Inference and Diagnostics — other settings where robust/permutation inference matters.
- The Experimental Ideal — covariate adjustment via regression with robust (Huber–White) standard errors.