Estimating expected reward using data

I have been talking about $E_{\pi} R(S,A,S')$ , and how it is useful in decision analysis. How would one, though, estimate $E_{\pi} R(S,A,S')$ from data (for more on this, see e.g. Precup et al. 2000 or Thomas 2015)? Recall that $E$ is expectation, $S$ pre-tx state, $S'$ post-tx state, $A$ a tx, $\pi$ a tx strategy, and $R$ a reward.

If you could just try $\pi$ out, you could just take actions according to $\pi(a|s)$ and average the rewards that you observe.

Unfortunately, in medicine, we cannot often just try out a new treatment strategy. Hence, we need to estimate $E_{\pi} R(S,A,S')$ from data that was generated under $\pi_0$ (ie, the standard of care).

As before (see here for more on this equation) write

Now, multiply the above expression by $1=\pi_0(a|s)/\pi_0(a|s),$ where $\pi_0$ is the standard of care treatment strategy.

$E_\pi R(S,A,S') = \int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\pi(a|s)\frac{\pi_0(a|s)}{\pi_0(a|s)}p(s)ds'ds$

Rearrange slightly to note that $E_\pi R(S,A,S')= \int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\pi(a|s)\frac{\pi_0(a|s)}{\pi_0(a|s)}p(s)ds'ds$ can be written
$\int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\frac{\pi(a|s)}{\pi_0(a|s)}\pi_0(a|s)p(s)ds'ds,$ which can further be written $E_{\pi_0} \frac{\pi(A|S)}{\pi_0(A|S)}R(S,A,S')!$

Given $\pi_n$ , an estimator for $\pi_0$ , the last expression can be estimated using data $(A_1,S_1,S'_1,...,A_n,S_n,S'_n)$ that was generated by $\pi_0$ as (this is an IPW estimator) $V_n:$

$V_n(\pi)=\frac{1}{n}\sum_i \frac{\pi(A_i|S_i)}{\pi_n(A_i|S_i)}R(S_i,A_i,S'_i).$

This is a simplified setting, but it is still somewhat magical. It doesn’t come without assumptions, though.

For example, note that if $\pi_n(A_i|S_i)=0$ for any $A_i$ and $S_i,$ we are in trouble, because we cannot divide by zero. Actually if $\pi_0(a|s)=0$ for any $a$ and $s$ , we are in trouble.

Note also that to estimate $V_n,$ we need to have collected all the covariates, $S,$ that influence both $S'$ and $A$ , since $S'$ and $A$ influence $R(S,A,S')$ . I will try to give more rationale for this at some point.

If one accepts these assumptions (which is a substantial “if”), then one can obtain an expected utility-maximizing treatment policy, $\pi^* = \arg\max_{\pi} V_n(\pi).$

2 responses to “Estimating expected reward using data”

Why are all confounders needed? – Stats, meds, and humanity

February 12, 2025

[…] said before, that to estimate we need to have collected all the covariates, that influence both and , since […]

Reflections on Relative Sparsity – Statistics, medicine, humanity

April 4, 2025

[…] is an interesting approach (for example, see e.g., Thomas or Wager). I summarize it more formally here. Observational data is plentiful. It is in our best interest to have techniques, like Relative […]

Estimating expected reward using data

2 responses to “Estimating expected reward using data”

Leave a comment Cancel reply