Estimating expected reward using data

I have been talking about E_{\pi} R(S,A,S'), and how it is useful in decision analysis. How would one, though, estimate E_{\pi} R(S,A,S') from data (for more on this, see e.g. Precup et al. 2000 or Thomas 2015)? Recall that E is expectation, S pre-tx state, S' post-tx state, A a tx, \pi a tx strategy, and R a reward.

If you could just try \pi out, you could just take actions according to \pi(a|s) and average the rewards that you observe.

Unfortunately, in medicine, we cannot often just try out a new treatment strategy. Hence, we need to estimate E_{\pi} R(S,A,S') from data that was generated under \pi_0 (ie, the standard of care).

As before (see here for more on this equation) write

Now, multiply the above expression by 1=\pi_0(a|s)/\pi_0(a|s), where \pi_0 is the standard of care treatment strategy.

E_\pi R(S,A,S') = \int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\pi(a|s)\frac{\pi_0(a|s)}{\pi_0(a|s)}p(s)ds'ds

Rearrange slightly to note that E_\pi R(S,A,S')= \int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\pi(a|s)\frac{\pi_0(a|s)}{\pi_0(a|s)}p(s)ds'ds can be written
\int_s \int_{s'} \sum_a R(s,a,s') p(s'|a,s)\frac{\pi(a|s)}{\pi_0(a|s)}\pi_0(a|s)p(s)ds'ds, which can further be written E_{\pi_0} \frac{\pi(A|S)}{\pi_0(A|S)}R(S,A,S')!

Given \pi_n, an estimator for \pi_0, the last expression can be estimated using data (A_1,S_1,S'_1,...,A_n,S_n,S'_n) that was generated by \pi_0 as (this is an IPW estimator) V_n:

V_n(\pi)=\frac{1}{n}\sum_i \frac{\pi(A_i|S_i)}{\pi_n(A_i|S_i)}R(S_i,A_i,S'_i).

This is a simplified setting, but it is still somewhat magical. It doesn’t come without assumptions, though.

For example, note that if \pi_n(A_i|S_i)=0 for any A_i and S_i, we are in trouble, because we cannot divide by zero. Actually if \pi_0(a|s)=0 for any a and s, we are in trouble.

Note also that to estimate V_n, we need to have collected all the covariates, S, that influence both S' and A, since S' and A influence R(S,A,S'). I will try to give more rationale for this at some point.

If one accepts these assumptions (which is a substantial “if”), then one can obtain an expected utility-maximizing treatment policy, \pi^* = \arg\max_{\pi} V_n(\pi).

2 responses to “Estimating expected reward using data”

  1. […] said before, that to estimate we need to have collected all the covariates, that influence both and , since […]

  2. […] is an interesting approach (for example, see e.g., Thomas or Wager). I summarize it more formally here. Observational data is plentiful. It is in our best interest to have techniques, like Relative […]

Leave a comment