S-Learner

Summary

The S-learner (single learner) estimates the CATE by fitting a single regression model $\overset{μ}{^} (x, w)$ on all data with the treatment indicator $W$ included as a feature. The CATE estimate is the difference in predictions with $W = 1$ vs. $W = 0$ . Simple to implement but may underperform when base learners regularize the treatment indicator toward zero.

Overview

The S-learner is the most straightforward metalearner. It treats treatment status $W$ as just another covariate and delegates all structure learning to the base learner.

Definition and Algorithm

Definition: S-Learner

Step 1: Fit a single response function $\overset{μ}{^}$ using all observed data:
$\overset{μ}{^} (x, w) = E [Y ∣ X = x, W = w]$
using any supervised learning method that estimates the conditional mean.

Step 2: Estimate the CATE as the difference in predictions:
$\overset{τ}{^}^{S} (x) = \overset{μ}{^} (x, 1) - \overset{μ}{^} (x, 0)$

Properties and Performance

Key advantage:

Borrows strength across treatment and control groups — all data used in one model
With linear base learner, $\overset{μ}{^} (x, w) = β_{0} + β_{W} w + β_{X} x$ gives $\overset{τ}{^}^{S} (x) = β_{W}$ (constant ATE)

Key weakness:

Many ML algorithms (e.g., random forests) regularize features equally. If the treatment effect is small relative to other variation, the treatment indicator $W$ may effectively be shrunk toward zero → $\overset{τ}{^}^{S} (x) \approx 0$ even when effects are heterogeneous
RF as base learner: Treatment indicator $W$ assigned the same split probability as $p$ covariates, so it is selected $\approx 1/ (p + 1)$ of the time → treatment effect underestimated proportionally

When S-learner performs well:

Treatment effect is constant (or close to constant) and truly small
The CATE function is simpler than either response function
Large datasets where regularization pressure is low

When it fails:

Highly heterogeneous CATE
Base learner regularizes features to zero (random forests, LASSO)
Propensity score far from 0.5

Illustration (Fig. 1A, 1B)

In a simple example with one covariate $x \in [- 1, 1]$ and piecewise linear $τ (x)$ :

$\overset{μ}{^}_{0} (x)$ (blue) fits the control group — matches data well
$\overset{μ}{^}_{1} (x)$ (dashed) fits the treated group — but without borrowing, it is relatively poor
The S-learner combines both, producing smoother but potentially biased $\overset{τ}{^}^{S}$

Connections

Part of Metalearners for CATE framework
Contrast with T-Learner and Minimax Rate (separate models) and X-Learner (two-stage imputation)
Nonparametric Causal Inference — BART S-learner is common in Bayesian causal inference

Second Brain

Explorer

S-Learner

S-Learner

Overview

Definition and Algorithm

Properties and Performance

Illustration (Fig. 1A, 1B)

Connections

See Also

Graph View

Table of Contents

Backlinks