Optimal Regime via Dynamic Programming

Summary

The optimal dynamic treatment regime $d^{opt}$ is characterized by backward induction (dynamic programming). Working from the last decision $K$ to the first, define Q-functions $Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k}) = E (value-to-go ∣ history)$ measuring the “quality” of taking treatment $a_{k}$ now and behaving optimally thereafter, and value functions $V_{k} = max_{a_{k}} Q_{k}$ . The optimal rule at each stage maximizes the Q-function: $d_{k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = ar g max_{a_{k}} Q_{k}$ . Under consistency, sequential randomization, and positivity these observed-data Q-functions coincide with the potential-outcome definitions, so $d^{opt}$ is identifiable. A key result: the optimal rules do not depend on when a patient presents (decision 1 vs. “midstream”), so a single rule set $d^{opt}$ applies to all patients.

Overview

This note gives the engine shared by Q-learning and A-learning and Robustness: the dynamic-programming characterization of $d^{opt}$ . The two methods differ only in how they estimate the pieces of this recursion. The recursion is first written in potential outcomes, then shown to equal an observed-data version under the identification assumptions.

Main Content

Backward induction in potential outcomes

Optimal regime by backward recursion (§3, Eqs. 5-8)

At the last decision $K$ , for histories $(\overset{s}{ˉ}_{K}, \overset{a}{ˉ}_{K - 1}) \in Γ_{K}$ :
$d_{K}^{(1) opt} (\overset{s}{ˉ}_{K}, \overset{a}{ˉ}_{K - 1}) = ar g a_{K} \in Ψ_{K} max E {Y^{*} (\overset{a}{ˉ}_{K - 1}, a_{K}) ∣ \overset{ˉ}{S}_{K}^{*} (\overset{a}{ˉ}_{K - 1}) = \overset{s}{ˉ}_{K}},$
with value $V_{K}^{(1)} = max_{a_{K}} E {Y^{*} ∣ \cdot}$ . Then for $k = K - 1, \dots, 1$ :
$d_{k}^{(1) opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = ar g a_{k} \in Ψ_{k} max E [V_{k + 1}^{(1)} {\overset{s}{ˉ}_{k}, S_{k + 1}^{*} (\overset{a}{ˉ}_{k - 1}, a_{k}), \overset{a}{ˉ}_{k - 1}, a_{k}} ∣ \overset{ˉ}{S}_{k}^{*} (\overset{a}{ˉ}_{k - 1}) = \overset{s}{ˉ}_{k}] .$
Each stage chooses the treatment maximizing the expected outcome assuming optimal behavior at all later stages — the defining property of dynamic programming. (Section A.1 of the supplement proves this $d^{(1) opt}$ is optimal in the sense of the regime definition.)

Observed-data Q- and value functions

Q-functions and value functions (§3, Eqs. 9-14)

In terms of the observed data, define at decision $K$ :
$Q_{K} (\overset{s}{ˉ}_{K}, \overset{a}{ˉ}_{K}) = E (Y ∣ \overset{ˉ}{S}_{K} = \overset{s}{ˉ}_{K}, \overset{ˉ}{A}_{K} = \overset{a}{ˉ}_{K}), V_{K} (\overset{s}{ˉ}_{K}, \overset{a}{ˉ}_{K - 1}) = a_{K} \in Ψ_{K} max Q_{K} (\overset{s}{ˉ}_{K}, \overset{a}{ˉ}_{K - 1}, a_{K}),$
and recursively for $k = K - 1, \dots, 1$ :
$Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k}) = E {V_{k + 1} (\overset{s}{ˉ}_{k}, S_{k + 1}, \overset{a}{ˉ}_{k}) ∣ \overset{ˉ}{S}_{k} = \overset{s}{ˉ}_{k}, \overset{ˉ}{A}_{k} = \overset{a}{ˉ}_{k}},$ $d_{k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = ar g a_{k} \in Ψ_{k} max Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}, a_{k}), V_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = a_{k} max Q_{k} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}, a_{k}) .$
$Q_{k}$ measures the “quality” of treatment $a_{k}$ given the history, then following the optimal regime thereafter; $V_{k}$ is the “value” of a history assuming optimal future decisions.

Identification: observed-data optimum equals potential-outcome optimum (§3, Eq. 19)

Under consistency, sequential randomization, and positivity, the conditional distributions of the observed data in (9)–(14) equal those of the potential outcomes in (5)–(8), so
$d_{k}^{(1) opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = d_{k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}), V_{k}^{(1)} = V_{k}, k = 1, \dots, K .$
Thus an optimal regime in the $Ψ$ -specific class $D$ can be obtained from the distribution of the observed data. (The optimum need not be unique — any rule selecting an arg-max treatment is optimal.)

Midstream regimes: the rule set is presentation-invariant

Optimal rules do not depend on when a patient presents (§4, Eq. 25)

A new patient may present “midstream” — immediately prior to decision $ℓ > 1$ , having received $ℓ - 1$ treatments under routine practice. One can define an optimal regime $d^{(ℓ) opt}$ starting at decision $ℓ$ . Under the same assumptions (plus consistency of the presenting covariates), the midstream rules coincide with the original recursion:
$d_{k}^{(ℓ) opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}) = d_{k}^{opt} (\overset{s}{ˉ}_{k}, \overset{a}{ˉ}_{k - 1}), k = ℓ, \dots, K,$
subsuming the $ℓ = 1$ case. Consequence: the single rule set $d^{opt} = (d_{1}^{opt}, \dots, d_{K}^{opt})$ is relevant for any patient regardless of when they present — treatment for a midstream patient is $d_{ℓ}^{opt}$ evaluated at their history (Robins 2004).

Caveat: this presentation-invariance requires the conditioning sets to carry the same information; it relies on the sequential-randomization and consistency assumptions holding for the routine-practice history.

Connections

The shared target of Q-learning (which models the Q-functions directly) and A-learning and Robustness (which models only the contrast $C_{k} = Q_{k} (\cdot, 1) - Q_{k} (\cdot, 0)$ , sufficient for the arg-max).
Backward induction over Q/value functions is exactly the dynamic-programming / Bellman structure used in reinforcement learning.
Builds on the estimand and assumptions of the Dynamic Treatment Regimes Framework.

Second Brain

Explorer

Optimal Regime via Dynamic Programming

Optimal Regime via Dynamic Programming

Overview

Main Content

Backward induction in potential outcomes

Observed-data Q- and value functions

Midstream regimes: the rule set is presentation-invariant

Connections

See Also

Graph View

Table of Contents

Backlinks