Metalearners for CATE

Summary

A metalearner (or meta-algorithm) is any algorithm that takes base ML learners as inputs and combines them to estimate the CATE. The framework decouples the structural problem of CATE estimation from the choice of base learner, enabling use of any supervised ML method (random forests, BART, neural nets) as a drop-in component.

Overview

Why metalearners? Estimating CATE $τ (x) = E [Y (1) - Y (0) ∣ X = x]$ directly is hard because one never observes both potential outcomes for the same unit. Metalearners exploit the structure of the problem — splitting it into subproblems where standard supervised ML excels.

Setup and Notation

Definition: Potential Outcomes Setup

For each unit $i$ with covariates $X_{i} \in R^{p}$ :

$Y_{i} (0)$ = potential outcome under control

$Y_{i} (1)$ = potential outcome under treatment

$W_{i} \in {0, 1}$ = treatment indicator

Observed outcome: $Y_{i} = Y_{i} (W_{i})$

CATE: $τ (x) = E [Y_{i} (1) - Y_{i} (0) ∣ X_{i} = x]$

ATE: $τ = E [τ (X_{i})]$

Definition: Metalearner

A metalearner (or metaalgorithm) is an algorithm $\overset{τ}{^}$ that:

Takes one or more supervised learning base learners $μ_{0}, μ_{1}$ (or $μ$ ) as inputs

Uses these base learners to estimate response functions $μ_{0} (x) = E [Y (0) ∣ X = x]$ , $μ_{1} (x) = E [Y (1) ∣ X = x]$

Combines the estimates to produce $\overset{τ}{^} (x)$

The base learner can be any supervised ML method that minimizes expected squared error (regression) or any analogous loss.

Superpopulation Model

Units are drawn i.i.d. from a superpopulation $P$ over $(X, W, Y (0), Y (1))$ . The treatment indicator $W \sim Bern (e (X))$ where $e (x) = P (W = 1 ∣ X = x)$ is the propensity score.

Families of Distributions and Minimax Rate

Definition: Family with Bounded Minimax Rate

For $a \in (0, 1]$ , the family $S (a)$ is the set of families $F$ with a minimax rate $C N^{- a}$ :
$P \in F sup EMSE (\overset{μ}{^}, \overset{μ}{^}_{N}) \leq C N^{- a}$
for some constant $C$ , where $\overset{μ}{^}_{N}$ is the best estimator using $N$ samples.

$F_{0} \in S (1)$ — families where we can estimate response at the parametric rate

$F_{2} \in S (2/3)$ — nonparametric regression on $R^{d}$ requires rate $N^{- 2/ (2 + d)}$

Key implication for CATE: Since CATE is a difference of two conditional means, its estimation rate depends on the smoothness of both response functions and the CATE function itself. The X-learner exploits the case where the CATE is smoother than the response functions.

EMSE for CATE

Definition: EMSE for CATE Estimator

The Expected Mean Squared Error for a CATE estimator $\overset{τ}{^}$ over $N$ observations with $n$ treated units:
$EMSE (P, \overset{τ}{^}^{mn}) = E [(τ (X) - \overset{τ}{^} (X))^{2} \cdot i = 1 \sum N w_{i}]$
where the $w_{i}$ are importance weights ensuring the loss is meaningful when treatment groups are unequal.

Three Metalearners

Learner	Strategy	Key Advantage	Key Weakness
S-Learner	Single model on $(X, W)$	Borrows strength across groups	Treatment indicator may be regularized to zero
T-Learner	Separate models for $W = 0$ and $W = 1$	Clean separation	Suboptimal for unbalanced groups
X-Learner	Two-stage: impute ITEs, then regress	Best for unbalanced treatment	More complex; requires propensity score

Connections

Builds on Causal Estimands — CATE is the target quantity
Potential Outcomes Framework — the theoretical foundation
Propensity Score in Bayesian CI — propensity score $e (x)$ used by X-learner as weighting function
Nonparametric Causal Inference — BART is a common base learner for metalearners

Second Brain

Explorer

Metalearners for CATE

Metalearners for CATE

Overview

Setup and Notation

Superpopulation Model

Families of Distributions and Minimax Rate

EMSE for CATE

Three Metalearners

Connections

See Also

Graph View

Table of Contents

Backlinks