Studentized Randomization Tests

Summary

The central result of Wu & Ding (2021): running the FRT with a studentized (Wald-type) statistic — the estimated contrast scaled by a heteroscedasticity-robust covariance estimator — yields a test with dual validity: it is finite-sample exact under the sharp null (a free property of any FRT) and asymptotically conservative (valid type I error) under the weak null . It is model-free and agnostic to treatment-effect heterogeneity. Non-studentized statistics (, the statistic, the Box-type statistic ) lack this and can fail. Practical recommendation: always use .

Overview

Recall (from Sharp vs Weak Null Hypotheses) the weak null . The estimator of arm means satisfies with , where . The true depends on the unestimable cross-arm covariances , ; but the diagonal “Neyman” estimator

is conservative (it over-estimates the true variance). Studentizing by is what makes the FRT robust to variance heterogeneity.

Proposition 4 (the criterion). The FRT with statistic controls type I error at any level for if, under the null, the sampling distribution of is stochastically dominated by its randomization distribution (written ). A statistic with this property is called proper. The whole game is to find a that is proper; is.

Main Content

The studentized statistic

A Wald-type quadratic form using the conservative robust covariance estimator for . In the treatment-control case () it reduces to the squared studentized ATE,

i.e. the square of Neyman’s ATE estimate over its (heteroscedasticity-robust) standard error.

Theorem 1 — is proper (dual validity)

Under Assumption 1, the sampling distribution satisfies, under ,

a weighted sum of independent variates with weights at most 1. Under the stronger Assumption 2, with , the randomization distribution satisfies

Because stochastically dominates (weights ), the FRT with asymptotically conservatively controls type I error under the weak null. Combined with finite-sample exactness under the sharp null, is robust on two classes of nulls.

Box-type statistic is NOT proper

The Box-type statistic (with ) has asymptotic-mean ratio but this is necessary, not sufficient, for the stochastic-dominance criterion of Proposition 4. Hence the FRT with cannot control type I error in general, even asymptotically. Exceptions: equal variances, or a one-dimensional hypothesis ( a row vector, where ).

OLS statistic is NOT proper; Huber–White repairs it

The classical regression statistic uses a pooled variance , which presumes homoscedasticity — incompatible with the potential-outcomes framework. Under , with weights that can exceed 1, so is improper (fails type I control under heteroscedasticity). Replacing by the Huber–White estimator gives , which is asymptotically equivalent to (since ). Covariate adjustment / regression-based inference must pair the robust (HW) covariance with the FRT.

Two valid tests, one statistic ^def-two-tests

Theorem 1 yields two asymptotically conservative tests from : (a) the FRT — compare observed to its randomization distribution (also finite-sample exact for ); (b) the approximation — reject if exceeds the quantile of (no Monte Carlo). The FRT has the extra finite-sample-exactness property; in simulations and applications it tends to be slightly more conservative than the approximation.

Practical recommendation ^summary-recommendation

Use the FRT with the studentized statistic (equivalently for ATE, or the Huber–White-robust ). It is model-free, finite-sample exact under the sharp null, asymptotically valid under the weak null, robust to treatment-effect heterogeneity and unequal variances, and extends to stratified, clustered, factorial, ANOVA, trend-test, and binary-outcome designs. Avoid the non-studentized , plain , and Box-type for weak nulls except in the narrow special cases (equal variances; balanced; binary outcomes under the equal-means null).

Examples

Charness & Gneezy (2009), financial incentives for exercise (paper’s Sec. 7.1). college students, arms (no / small / large incentive), ; outcome = change in weekly gym visits. Sample means and sample variances — clearly heteroscedastic. Testing the weak null at the 1% level, the FRT-with- and its approximation give congruent, significant results, whereas the -based test is overly conservative for this data (its -values inflate). Guided by the theory, one trusts the -values over the ones. (A tiny jitter was added to outcomes because many were exactly 0, to avoid degenerate permuted groups.)

Sanity computation of . Two arms, , , . Then (so ). Compare to the randomization distribution of (asymptotically , whose quantile is ) — borderline reject at 5%. Crucially the denominator used arm-specific variances, not a pooled one.

Connections

See Also