X-Learner

Summary

The X-learner is a two-stage metalearner that imputes individual treatment effects (ITEs) by cross-applying fitted response functions, then regresses the imputed ITEs on covariates. It is particularly powerful when treatment and control groups are unbalanced, as it can exploit the large group to improve estimation in the small group. It achieves a minimax optimal rate that adapts to both CATE smoothness and response function smoothness.

Overview

The X-learner solves the key failure mode of the T-learner: when treatment groups are unbalanced, one response function is estimated precisely and the other poorly — but the T-learner cannot cross-use information.

The X-learner crosses the group information in Stage 2: it uses the well-estimated control model to impute counterfactuals for treated units, and vice versa.

Algorithm

Definition: X-Learner (Three Steps)

Stage 1 — Estimate response functions (same as T-learner):

Stage 2 — Impute individual treatment effects:

For treated units :

(observed treated outcome minus imputed control outcome using )

For control units :

(imputed treatment outcome using minus observed control outcome)

Stage 2 — Regress imputed ITEs:

Stage 3 — Combine with propensity score weights:

where is a weighting function, often set to the propensity score .

Intuition

The X-learner uses the large group (say, control with many observations) to improve estimation for the small group (treatment):

  • is estimated precisely from many control observations
  • This precise is cross-applied to treated units to impute their counterfactual
  • The imputed ITE is then a cleaner signal for regressing the CATE on

In the second stage, is estimated from treated units with imputed ITEs — these have reduced variance because is very accurate.

Minimax Rate Theorem

Theorem 2: Minimax Optimality of X-Learner

Assume we observe control and treated units, with (unbalanced design). For families satisfying Conditions 1-6 (Lipschitz continuity, bounded propensity score, bounded moments):

where is the total sample size and is the smaller group size.

Key insight: The X-learner rate is . If (CATE is smoother than the response functions), the X-learner can achieve — the full-data rate — rather than being bottlenecked by the small group.

Contrast with T-learner: T-learner is bounded by regardless. X-learner additionally exploits when the CATE function is smooth.

Conditions for X-Learner Advantage

The X-learner outperforms T-learner when:

  1. Unbalanced groups: One arm has far more observations than the other
  2. Smooth CATE: — treatment effect is simpler than the response functions
  3. Large control group: Can impute good counterfactuals for treated units

When the CATE is constant (or near-zero), the X-learner advantage is largest because (constant is infinitely smooth).

Propensity Score as Weight

The weighting function balances and . Using :

  • When is small (few treated units), weight is put on (estimated from the larger control group)
  • When is large (many treated), weight is put on

This ensures that the better-estimated CATE component dominates.

Connections

See Also