Q-learning

Summary

Q-learning (“Q” for quality; Watkins 1989) estimates the optimal dynamic treatment regime by directly modeling the Q-functions and fitting them by backward-recursive regression. At each decision (from down to ) one posits a parametric model , regresses the current “value-to-go” response on it by OLS/WLS, and reads off the optimal rule . Simple and using familiar regression machinery, but consistent only if every Q-function is correctly specified — and for the response is an estimated, typically highly nonlinear value function, so linear working models are easily misspecified and errors propagate backward.

Overview

Q-learning is the more transparent of the two methods in Q- and A-learning - Overview: it turns the dynamic-programming recursion into a sequence of regressions. This note gives the backward-fitting algorithm, the linear-model illustration, and the misspecification concern that motivates A-learning and Robustness.

Main Content

Backward-recursive fitting

Q-learning estimating equations (§5.1, Eqs. 26-27)

Posit models for . Fit backward:

  • Decision (response ): solve for

where is a working variance model (constant OLS).

  • Form the fitted value-to-go for the previous stage: .
  • Decision (, response ): solve the analogous equation for , then form .

The estimated optimal regime is

Linear two-decision illustration ( , binary treatment) (§5.1, Eq. 29)

With and history vectors , , posit linear models

with . Then the optimal rules are threshold rules on the treatment-interaction term:

The catch: is a standard regression of on observed data, but models where — a max of a regression, generally highly nonlinear, only approximated by the linear .

Explicitly, if is normal, the true involves the normal cdf and is clearly nonlinear (§5.3, Eq. 33) — so the posited linear is misspecified, and for larger this incompatibility propagates from down to .

Efficiency, robustness, and remedies

  • Efficiency. At decision (response ), taking gives the asymptotically efficient estimator; in practice OLS/WLS is standard. Some authors define Q-learning as OLS fitting (Chakraborty, Murphy & Strecher 2010).
  • Consistency requires all models correct. Even under sequential randomization, is generally inconsistent for the true optimal regime if any is misspecified.
  • Flexible models. Misspecification risk can be reduced with flexible/ML regressions (e.g. SVR; Zhao, Kosorok & Zeng 2009) tuned by cross-validated MSE — at the cost of interpretability (“black box” rules). A compromise is to fit a simple, interpretable model (e.g. a decision tree) to the complex model’s fitted values.

Connections

  • Implements the backward-induction recursion by stagewise regression; the dynamic-programming analog in reinforcement learning is also called Q-learning (Watkins & Dayan 1992).
  • Contrast with A-learning and Robustness, which models only the treatment contrast and the propensity, gaining double robustness.
  • The stagewise-regression idea generalizes single-stage outcome modeling such as the T-learner.

See Also