Q-learning

Summary

Q-learning (“Q” for quality; Watkins 1989) estimates the optimal dynamic treatment regime by directly modeling the Q-functions and fitting them by backward-recursive regression. At each decision $k$ (from $K$ down to $1$ ) one posits a parametric model $Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k}; ξ_{k})$ , regresses the current “value-to-go” response on it by OLS/WLS, and reads off the optimal rule $\hat{d}_{k}^{opt} = ar g max_{a_{k}} Q_{k} (\cdot; \hat{ξ}_{k})$ . Simple and using familiar regression machinery, but consistent only if every Q-function is correctly specified — and for $k < K$ the response is an estimated, typically highly nonlinear value function, so linear working models are easily misspecified and errors propagate backward.

Overview

Q-learning is the more transparent of the two methods in Q- and A-learning - Overview: it turns the dynamic-programming recursion into a sequence of regressions. This note gives the backward-fitting algorithm, the linear-model illustration, and the misspecification concern that motivates A-learning and Robustness.

Main Content

Backward-recursive fitting

Q-learning estimating equations (§5.1, Eqs. 26-27)

Posit models $Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k}; ξ_{k})$ for $k = K, \dots, 1$ . Fit backward:

Decision $K$ (response $\overset{ˉ}{V}_{(K + 1) i} = Y_{i}$ ): solve for $\hat{ξ}_{K}$

$i = 1 \sum n \frac{\partial Q _{K} ( S ˉ _{K i} , A ˉ _{K i} ; ξ _{K} )}{\partial ξ _{K}} Σ_{K}^{- 1} (\overset{ˉ}{S}_{K i}, \overset{ˉ}{A}_{K i}) {\overset{ˉ}{V}_{(K + 1) i} - Q_{K} (\overset{ˉ}{S}_{K i}, \overset{ˉ}{A}_{K i}; ξ_{K})} = 0,$
where $Σ_{K}$ is a working variance model (constant $Σ_{K} \Rightarrow$ OLS).

Form the fitted value-to-go for the previous stage: $\overset{ˉ}{V}_{K i} = max_{a_{K} \in Ψ_{K}} Q_{K} (\overset{ˉ}{S}_{K i}, \overset{ˉ}{A}_{(K - 1) i}, a_{K}; \hat{ξ}_{K})$ .

Decision $k$ ( $k = K - 1, \dots, 1$ , response $\overset{ˉ}{V}_{(k + 1) i}$ ): solve the analogous equation for $\hat{ξ}_{k}$ , then form $\overset{ˉ}{V}_{ki} = max_{a_{k}} Q_{k} (\overset{ˉ}{S}_{ki}, \overset{ˉ}{A}_{(k - 1) i}, a_{k}; \hat{ξ}_{k})$ .

The estimated optimal regime is
$\hat{d}_{Q, 1}^{opt} (s_{1}) = d_{1}^{opt} (s_{1}; \hat{ξ}_{1}), \hat{d}_{Q, k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = d_{k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}; \hat{ξ}_{k}), k = 2, \dots, K . (28)$

Linear two-decision illustration ( $K = 2$ , binary treatment) (§5.1, Eq. 29)

With $Ψ_{k} = {0, 1}$ and history vectors $H_{1} = (1, s_{1}^{⊤})^{⊤}$ , $H_{2} = (1, s_{1}^{⊤}, a_{1}, s_{2}^{⊤})^{⊤}$ , posit linear models
$Q_{1} (s_{1}, a_{1}; ξ_{1}) = H_{1}^{⊤} β_{1} + a_{1} (H_{1}^{⊤} ψ_{1}), Q_{2} (\overset{s}{ˉ}_{2}, \overset{a}{ˉ}_{2}; ξ_{2}) = H_{2}^{⊤} β_{2} + a_{2} (H_{2}^{⊤} ψ_{2}),$
with $ξ_{k} = (β_{k}^{⊤}, ψ_{k}^{⊤})^{⊤}$ . Then the optimal rules are threshold rules on the treatment-interaction term:
$d_{2}^{opt} (\overset{s}{ˉ}_{2}, a_{1}; ξ_{2}) = I (H_{2}^{⊤} ψ_{2} > 0), d_{1}^{opt} (s_{1}; ξ_{1}) = I (H_{1}^{⊤} ψ_{1} > 0) .$
The catch: $Q_{2}$ is a standard regression of $Y$ on observed data, but $Q_{1}$ models $E {V_{2} (\overset{s}{ˉ}_{2}, a_{1}) ∣ S_{1} = s_{1}, A_{1} = a_{1}}$ where $V_{2} = max_{a_{2}} Q_{2}$ — a max of a regression, generally highly nonlinear, only approximated by the linear $Q_{1}$ .

Explicitly, if $S_{2} ∣ S_{1}, A_{1}$ is normal, the true $Q_{1} (s_{1}, a_{1})$ involves the normal cdf $Φ$ and is clearly nonlinear (§5.3, Eq. 33) — so the posited linear $Q_{1}$ is misspecified, and for larger $K$ this incompatibility propagates from $K$ down to $1$ .

Efficiency, robustness, and remedies

Efficiency. At decision $K$ (response $Y$ ), taking $Σ_{K} = var (Y ∣ \cdot)$ gives the asymptotically efficient estimator; in practice OLS/WLS is standard. Some authors define Q-learning as OLS fitting (Chakraborty, Murphy & Strecher 2010).
Consistency requires all models correct. Even under sequential randomization, $\hat{d}_{Q}^{opt}$ is generally inconsistent for the true optimal regime if any $Q_{k}$ is misspecified.
Flexible models. Misspecification risk can be reduced with flexible/ML regressions (e.g. SVR; Zhao, Kosorok & Zeng 2009) tuned by cross-validated MSE — at the cost of interpretability (“black box” rules). A compromise is to fit a simple, interpretable model (e.g. a decision tree) to the complex model’s fitted values.

Connections

Implements the backward-induction recursion by stagewise regression; the dynamic-programming analog in reinforcement learning is also called Q-learning (Watkins & Dayan 1992).
Contrast with A-learning and Robustness, which models only the treatment contrast $C_{k} = Q_{k} (\cdot, 1) - Q_{k} (\cdot, 0)$ and the propensity, gaining double robustness.
The stagewise-regression idea generalizes single-stage outcome modeling such as the T-learner.

Second Brain

Explorer

Q-learning

Q-learning

Overview

Main Content

Backward-recursive fitting

Efficiency, robustness, and remedies

Connections

See Also

Graph View

Table of Contents

Backlinks