Q-learning
Summary
Q-learning (“Q” for quality; Watkins 1989) estimates the optimal dynamic treatment regime by directly modeling the Q-functions and fitting them by backward-recursive regression. At each decision (from down to ) one posits a parametric model , regresses the current “value-to-go” response on it by OLS/WLS, and reads off the optimal rule . Simple and using familiar regression machinery, but consistent only if every Q-function is correctly specified — and for the response is an estimated, typically highly nonlinear value function, so linear working models are easily misspecified and errors propagate backward.
Overview
Q-learning is the more transparent of the two methods in Q- and A-learning - Overview: it turns the dynamic-programming recursion into a sequence of regressions. This note gives the backward-fitting algorithm, the linear-model illustration, and the misspecification concern that motivates A-learning and Robustness.
Main Content
Backward-recursive fitting
Q-learning estimating equations (§5.1, Eqs. 26-27)
Posit models for . Fit backward:
- Decision (response ): solve for
where is a working variance model (constant OLS).
- Form the fitted value-to-go for the previous stage: .
- Decision (, response ): solve the analogous equation for , then form .
The estimated optimal regime is
Linear two-decision illustration ( , binary treatment) (§5.1, Eq. 29)
With and history vectors , , posit linear models
with . Then the optimal rules are threshold rules on the treatment-interaction term:
The catch: is a standard regression of on observed data, but models where — a max of a regression, generally highly nonlinear, only approximated by the linear .
Explicitly, if is normal, the true involves the normal cdf and is clearly nonlinear (§5.3, Eq. 33) — so the posited linear is misspecified, and for larger this incompatibility propagates from down to .
Efficiency, robustness, and remedies
- Efficiency. At decision (response ), taking gives the asymptotically efficient estimator; in practice OLS/WLS is standard. Some authors define Q-learning as OLS fitting (Chakraborty, Murphy & Strecher 2010).
- Consistency requires all models correct. Even under sequential randomization, is generally inconsistent for the true optimal regime if any is misspecified.
- Flexible models. Misspecification risk can be reduced with flexible/ML regressions (e.g. SVR; Zhao, Kosorok & Zeng 2009) tuned by cross-validated MSE — at the cost of interpretability (“black box” rules). A compromise is to fit a simple, interpretable model (e.g. a decision tree) to the complex model’s fitted values.
Connections
- Implements the backward-induction recursion by stagewise regression; the dynamic-programming analog in reinforcement learning is also called Q-learning (Watkins & Dayan 1992).
- Contrast with A-learning and Robustness, which models only the treatment contrast and the propensity, gaining double robustness.
- The stagewise-regression idea generalizes single-stage outcome modeling such as the T-learner.
See Also
- A-learning and Robustness — the contrast-based, doubly-robust alternative
- Optimal Regime via Dynamic Programming — the Q-functions being modeled
- Q- and A-learning - Overview — comparison and findings