X-Learner

Summary

The X-learner is a two-stage metalearner that imputes individual treatment effects (ITEs) by cross-applying fitted response functions, then regresses the imputed ITEs on covariates. It is particularly powerful when treatment and control groups are unbalanced, as it can exploit the large group to improve estimation in the small group. It achieves a minimax optimal rate that adapts to both CATE smoothness and response function smoothness.

Overview

The X-learner solves the key failure mode of the T-learner: when treatment groups are unbalanced, one response function is estimated precisely and the other poorly — but the T-learner cannot cross-use information.

The X-learner crosses the group information in Stage 2: it uses the well-estimated control model to impute counterfactuals for treated units, and vice versa.

Algorithm

Definition: X-Learner (Three Steps)

Stage 1 — Estimate response functions (same as T-learner):
$\overset{μ}{^}_{0} (x) estimated on control units; \overset{μ}{^}_{1} (x) estimated on treated units$
Stage 2 — Impute individual treatment effects:

For treated units $i \in {W_{i} = 1}$ :
$\tilde{D}_{i}^{1} := Y_{i}^{1} - \overset{μ}{^}_{0} (X_{i}^{1})$
(observed treated outcome minus imputed control outcome using $\overset{μ}{^}_{0}$ )

For control units $i \in {W_{i} = 0}$ :
$\tilde{D}_{i}^{0} := \overset{μ}{^}_{1} (X_{i}^{0}) - Y_{i}^{0}$
(imputed treatment outcome using $\overset{μ}{^}_{1}$ minus observed control outcome)

Stage 2 — Regress imputed ITEs:
$\overset{τ}{^}_{1} (x) = E [\tilde{D}^{1} ∣ X = x] (regress on treated units)$ $\overset{τ}{^}_{0} (x) = E [\tilde{D}^{0} ∣ X = x] (regress on control units)$
Stage 3 — Combine with propensity score weights:
$\overset{τ}{^}^{X} (x) = g (x) \overset{τ}{^}_{0} (x) + (1 - g (x)) \overset{τ}{^}_{1} (x)$
where $g (x) \in [0, 1]$ is a weighting function, often set to the propensity score $g (x) = e (x) = P (W = 1 ∣ X = x)$ .

Intuition

The X-learner uses the large group (say, control with many observations) to improve estimation for the small group (treatment):

$\overset{μ}{^}_{0}$ is estimated precisely from many control observations
This precise $\overset{μ}{^}_{0}$ is cross-applied to treated units to impute their counterfactual
The imputed ITE $\tilde{D}_{i}^{1}$ is then a cleaner signal for regressing the CATE on $X_{i}$

In the second stage, $\overset{τ}{^}_{1}$ is estimated from treated units with imputed ITEs — these have reduced variance because $\overset{μ}{^}_{0}$ is very accurate.

Minimax Rate Theorem

Theorem 2: Minimax Optimality of X-Learner

Assume we observe $n_{0}$ control and $n_{1}$ treated units, with $n_{0} ≫ n_{1}$ (unbalanced design). For families $P \in S (a_{0}, a_{τ})$ satisfying Conditions 1-6 (Lipschitz continuity, bounded propensity score, bounded moments):
$P \in F sup EMSE (P, \overset{τ}{^}^{X}) \leq C_{τ} (m^{- a_{τ}} + n^{- a_{0}})$
where $m$ is the total sample size and $n = min (n_{0}, n_{1})$ is the smaller group size.

Key insight: The X-learner rate is $min (m^{- a_{τ}}, n^{- a_{0}})$ . If $a_{τ} > a_{0}$ (CATE is smoother than the response functions), the X-learner can achieve $m^{- a_{τ}}$ — the full-data rate — rather than being bottlenecked by the small group.

Contrast with T-learner: T-learner is bounded by $n^{- a_{0}}$ regardless. X-learner additionally exploits $m^{- a_{τ}}$ when the CATE function is smooth.

Conditions for X-Learner Advantage

The X-learner outperforms T-learner when:

Unbalanced groups: One arm has far more observations than the other
Smooth CATE: $a_{τ} > a_{0}$ — treatment effect is simpler than the response functions
Large control group: Can impute good counterfactuals for treated units

When the CATE is constant (or near-zero), the X-learner advantage is largest because $a_{τ} \to \infty$ (constant is infinitely smooth).

Propensity Score as Weight

The weighting function $g (x)$ balances $\overset{τ}{^}_{0}$ and $\overset{τ}{^}_{1}$ . Using $g (x) = e (x) = P (W = 1 ∣ X = x)$ :

When $e (x)$ is small (few treated units), weight is put on $\overset{τ}{^}_{0}$ (estimated from the larger control group)
When $e (x)$ is large (many treated), weight is put on $\overset{τ}{^}_{1}$

This ensures that the better-estimated CATE component dominates.

Connections

Extends T-Learner and Minimax Rate by adding Stage 2 imputation
Uses propensity score $e (x)$ — see Propensity Score in Bayesian CI for Bayesian treatment
Applied to Metalearner Simulation Results real experiments
Software: hte R library implements X-, T-, S-learner with confidence intervals

Second Brain

Explorer

X-Learner

X-Learner

Overview

Algorithm

Intuition

Minimax Rate Theorem

Conditions for X-Learner Advantage

Propensity Score as Weight

Connections

See Also

Graph View

Table of Contents

Backlinks