- Primarily concerned with Schizophrenia treatment, but discusses a number of factors relevant to other types of treatment:
- No controlled exploration
- Limited data
- Partial observability
- Importance of understanding level of uncertainty

- Corpus covers 1460 patients in a two stage clinical trial
- Specifically mention issues of drug resistance.
- Defines state as patient’s history
- Using SMART trials: sequential multiple assignment randomized trials
- Using finite horizon MDPs (since it was a 2-stage study, the horizon is length 2)
- The study lasted 18 months, with data recorded every 3 months. 20 demographic variables and 30 variables relevant to treatment progress
- Talk about patient dropout as problematic in two ways: reduces data therefore increasing variance, and introduces bias.
- Removing data from trial dropouts can also introduce another type of bias, so they left it in

- On average symptoms improved over the course of the study, but on a case-by-case basis there was a very large amount of noise/variance
- An additional issue in this particular case is that all the medications were expected to perform similarly, so teasing them apart would be tricky
- Discuss use of multiple imputation where (I think) data that is missing is filled in probabilistically (repeatedly) and then that completed data is treated as the true data
- The exact method of multiple imputation used is outlined

- They use LSPI – want to use linear method because it is simple and there is concern related to use of high variance training data
- Not sure I understand their method of voting for treatment. Looks like they base results on the results of all possible 2-permutations of treatment together (saying goodness of treatment
*i*is as good as the average of doing*i*and then all other possible treatments) why not*i*is as good as it is with the*best*possible treatment?- Perhaps it could be argued its intended to reduce impact of variance, but that isn’t stated.
- They mention choosing to do “double bootstrapping” to reduce variance, but not as an argument for the method as a whole

- The method of estimating confidence intervals is different from the traditional Hoeffding bounds
- Use bootstrap resampling methods

- In all cases there is at least some overlap for the suggestion of optimal treatment
- Mention that SMART studies are underway for treatment of autism, substance abuse, and ADD
- “While this foundation is a good start, using RL methods to optimize treatment policies is not as simple as applying off-the-shelf RL methods. Though the planning horizon is very short (only 2 stages in our particular application) and the exploration policy is ﬁxed, upon closer inspection, a number of previously unexplored challenges arise when tackling prob-lems in this domain. In particular, our case study highlights issues such as pervasive missing data, the need to handle a high-dimensional and variable state space and the need to communicate the evidence for a recommended action and estimate conﬁdence in the actions selected by the learned policy.”
- “The adaptive conﬁdence interval method presented here assumes that the linear approxi-mation provides a high quality approximation to the optimal Q-function. The adaptive conﬁ-dence interval method has not yet been generalized for use with a non-linear approximation to the Q-function such as, for example, trees or a nearest neighbor estimators.”

Notes on what type of data the algorithm requires/consumes:

- Very short horizon (2 step)
- Small action set (5 different medications).
- Initial action/medication used was randomized

- Data was recorded every three months
- 20 demographic variables recorded at start
- 30 variables measured over time
- Examples are:
- Symptom levels
- Quality of life
- Side effect burden
- Medication compliance

- They differentiate between two types of state variables, there are the normal ones as we think of them, and then other state variables that “can aid in the accurate estimation of the Q-function, but that are not needed to inform the policy because their influence on the Q-function is the same for all actions.”
- They argue including these types of state variables help make the problem more Markovian, even though it is still highly partially observable by nature.
- There were 5 of these types of variables in this study for example:
- Binary variable related to a particular side effect
- Whether the patient was recently hospitalized in the past 3 months
- Where was the patient treated if so (private, state, VA, etc)
- Length of time in study prior to current stage
- Previous treatment