Category Archives: POMDP

The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference During Decision Making in Humans. Hampton, Bossaerts, O’Doherty. Journal of Neuroscience 2006.

  1. Finding in the paper is that activity in ventromedial prefrontal cortex (vmPFC) is more consistent with hierarchical than flat RL
  2. “These results suggest that brain regions, such as vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.”
  3. In the task studied, the task flips in the middle so “…contingencies will reverse.”
    1. In particular, there are two actions that can be selected from, both with stochastic payoff.  Once the better action is selected 4 times straight, the payoff distributions switch (so the bad arm becomes the good one)
    2. This is a POMDP, where people need to infer the hidden state dynamics in order to play well
  4. PFC is tasked with encoding high-order structure, and is associated with higher cognitive functions such as working memory, planning, and decision making, and also seems to be involved in encoding abstract rules
  5. Goal is to see if PFC activity “… would correlate better with an abstract state-based decision algorithm than with simple RL.” <I don’t yet understand what the distinction between these two are but I’m sure we’ll get there>
  6. Of the RL models tested, Q-learning with softmaxy selection had the best fit to subject data <I’m not sure that means their methodology was bad/lacking RL algos, or if people are just terrible>
  7. The “abstract state-based model” is a Bayesian hidden state Markov model
  8. The way the model is set up is that: “The decision to switch is implemented on the basis of the posterior probability that the last choice was incorrect.”
    1. “The state-based model predicts the subjects’ actual choice behavior (whether to switch or not) with an accuracy of  92 +/- 2%”
  9. Subjects made the right choice 61 +/- 2% of the time <this isn’t so much better than chance>
    1. “This is also close to the optimal performance of the state-based model that was 64% (using the actual task parameters).” <Is it that hard?>
  10. <I don’t really like their terminology (they call the “prior” the posterior before reward signal and the “posterior” the postererior after the reward signal)>
  11. <Anyway,> “The prior correct signal” was found to correlate with medial PFC (mPFC), adjacent OFC, and amygdala
  12. Difference in what they call prior/posterior (reward prediction error) lead to signal in ventral striatum as is found elsewhere
  13. “The prior correct signal from the state-based model is almost identical to the expected reward signal from the RL model.  Nevertheless, our paradigm permits sharp discrimination between the two models.  The predictions of the two models differ immediately after a switch in the subjects’ action choice.”
    1. Basically, the MDP planner (non-pomdp) just uses average rewards of the two actions, whereas the pomdp planner tries to differentiate action quality based on hidden state
  14. At p < 0.01, “…abstract state-based decision making may be especially localized to the vmPFC.”
  15. “The critical distinction between the state-based inference model and standard RL is what happens to the expected value of the newly chosen stimulus after subjects switch.  According to standard RL, the expected value of the new choice should be low, because that was the value it had when the subject had previously stopped selecting it (usually after receiving monetary losses on that stimulus).  In contrast, the state-based algorithm predicts that the expected value for the newly chosen action should be high, because unlike standard RL, it incorporates the knowledge that when one action is low in value, the other is high.”
    1. “… the expected value signal in the vmPFC jumps up even before a reward is delivered on the newly chosen action… updating seems to occur using previous knowledge of the task structure.”
  16. “The final decision whether to switch or stay was associated with activity in the anterior cingulate cortex and anterior insula, consistent with previous reports of a role for these regions in behavioral control (…).  These regions are in close proximity to areas that were significantly correlated with the prior probability that the current choice was incorrect as provided by the decision model.  A plausible interpretation of these findings is that the anterior insular and anterior cingulate cortex may actually be involved in using information about the inferred choice probabilities to compute the decision itself.”

Monte-Carlo Planning in Large POMDPs. David Silver, Joel Veness. Nips 2010.

  1. Deals with how to get MCTS (specifically UCT) working in POMDPs
  2. Maintains belief state with a particle filter
  3. The tree maintained by UCT is modified slightly
    1. Counts are on hist0ry (action and observations over time).
    2. History maps to the belief state
  4. The method used by UCT here is the unofficial version that builds an additional node onto the tree after each iteration
  5. Discuss convergence of UCT and this version for POMDPs, but don’t mention that sometimes actual learning is intractable
  6. The empirical results are impressive.  In smaller domains more exact methods slightly outperform PO-UCT, but PO-UCT is extremely cheap to run in comparison (multiple orders of magnitude).
  7. Some of the details (especially related to the particle filter) are omitted, but the source code was released
  8. Seems like there is very little literature on POMDPs and MCTS.  The stuff here is nice, although more details on the particle filter would be helpful

Learning Methods for Sequential Decision Making with Imperfect Representations. Thesis by Shivaram c.

Basic idea is dealing with the issue of how to do RL in continuous spaces, when it is difficult to represent what we would really like to.

Chapter 1:

  1. Compares value-based, and PS methods (sarsa lambda and CMA-ES, which seems to be not totally different from cross-entropy)
  2. Discussing using sarsa to seed policies for PS methods

Chapter 2:

  1. State aliasing is when two actually distinct states are recognized as the same (pomdps).  This can actually be helpful when generalization between the two states is useful
  2. Really solving POMDPs isn’t possible in large domains, but in some domains the partial observability doesn’t kill you and you might as well act based on the observations directly
    1. Belief states are basically impossible to deal with in continuous states anyway
  3. Discusses PS on keepaway, but a version that uses a FA for each action (probably estimating value for each) and then picking the action with the highest output

Chapter 3:

  1. Sometimes value-based methods are more effective than PS and vice versa- it depends on the setting.  Experimentally these settings are investigated in this chapter
  2. Discusses that in supervised learning, we have fairly good ideas as to what off-the shelf methods work well on what “type”  of problem, but in RL this body of knowledge is less well developed
  3. Part of the goal of the emprirical results here is to be able to tease apart what qualities of problems work well for which RL algorithms
  4. They tune each algorithm to each particular instance of each domain
  5. Experiments are in gridworlds…. but the point is they add a layer of POMDPness to it, where (nearby) states can get conflated
    1. I dont exactly grasp the method if PO-alizing it
    2. But there is discussion of applications to larger domains such as Tetris and Keepaway
  6. Their sarsa(lambda) implementation uses cmacs with linear value estimators
  7. Admit that exploration isn’t very important in this experimental setup
  8. Agent always has only two actions (north or east)
  9. …dropping it mid ch 3 (3.3.2) as it looks a bit orthogonal to my research.  I’m a bit interested in the parts about Tetris and Keepaway that I might look at later.

Informing sequential clinical decision-making through reinforcement learning: an empirical study. Susan M. Shortreed · Eric Laber · Daniel J. Lizotte · T. Scott Stroup · Joelle Pineau · Susan A. Murphy. Machine Learning 2010

  1. Primarily concerned with Schizophrenia treatment, but discusses a number of factors relevant to other types of treatment:
    1. No controlled exploration
    2. Limited data
    3. Partial observability
    4. Importance of understanding level of uncertainty
  2. Corpus covers 1460 patients in a two stage clinical trial
  3. Specifically mention issues of drug resistance.
  4. Defines state as patient’s history
  5. Using SMART trials: sequential multiple assignment randomized trials
  6. Using finite horizon MDPs (since it was a 2-stage study, the horizon is length 2)
  7. The study lasted 18 months, with data recorded every 3 months.  20 demographic variables and 30 variables relevant to treatment progress
  8. Talk about patient dropout as problematic in two ways: reduces data therefore increasing variance, and introduces bias.
    1. Removing data from trial dropouts can also introduce another type of bias, so they left it in
  9. On average symptoms improved over the course of the study, but on a case-by-case basis there was a very large amount of noise/variance
  10. An additional issue in this particular case is that all the medications were expected to perform similarly, so teasing them apart would be tricky
  11. Discuss use of multiple imputation where (I think) data that is missing is filled in probabilistically (repeatedly) and then that completed data is treated as the true data
    1. The exact method of multiple imputation used is outlined
  12. They use LSPI – want to use linear method because it is simple and there is concern related to use of high variance training data
  13. Not sure I understand their method of voting for treatment.  Looks like they base results on the results of all possible 2-permutations of treatment together (saying goodness of treatment i is as good as the average of doing i and then all other possible treatments) why not i is as good as it is with the best possible treatment?
    1. Perhaps it could be argued its intended to reduce impact of variance, but that isn’t stated.
    2. They mention choosing to do “double bootstrapping” to reduce variance, but not as an argument for the method as a whole
  14. The method of estimating confidence intervals is different from the traditional Hoeffding bounds
    1. Use bootstrap resampling methods
  15. In all cases there is at least some overlap for the suggestion of optimal treatment
  16. Mention that SMART studies are underway for treatment of autism, substance abuse, and ADD
  17. “While this foundation is a good start, using RL methods to optimize treatment policies is not as simple as applying off-the-shelf RL methods. Though the planning horizon is very short (only 2 stages in our particular application) and the exploration policy is fixed, upon closer inspection, a number of previously unexplored challenges arise when tackling prob-lems in this domain. In particular, our case study highlights issues such as pervasive missing data, the need to handle a high-dimensional and variable state space and the need to communicate the evidence for a recommended action and estimate confidence in the actions selected by the learned policy.”
  18. “The adaptive confidence interval method presented here assumes that the linear approxi-mation provides a high quality approximation to the optimal Q-function. The adaptive confi-dence interval method has not yet been generalized for use with a non-linear approximation to the Q-function such as, for example, trees or a nearest neighbor estimators.”

Notes on what type of data the algorithm requires/consumes:

  1. Very short horizon (2 step)
  2. Small action set (5 different medications).
    1. Initial action/medication used was randomized
  3. Data was recorded every three months
    1. 20 demographic variables recorded at start
    2. 30 variables measured over time
    3. Examples are:
      1. Symptom levels
      2. Quality of life
      3. Side effect burden
      4. Medication compliance
  4. They differentiate between two types of state variables, there are the normal ones as we think of them, and then other state variables that “can aid in the accurate estimation of the Q-function, but that are not needed to inform the policy because their influence on the Q-function is the same for all actions.”
    1. They argue including these types of state variables help make the problem more Markovian, even though it is still highly partially observable by nature.
    2. There were 5 of these types of variables in this study for example:
      1. Binary variable related to a particular side effect
      2. Whether the patient was recently hospitalized in the past 3 months
      3. Where was the patient treated if so (private, state, VA, etc)
      4. Length of time in study prior to current stage
      5. Previous treatment

Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable MDPs. John Loch, Satinder Singh.

  1. Basically uses eligibility traces to find policies in POMDPs; previous methods mainly used memory to estimate state, which gets computationally expensive
  2. Results are relevant to: “POMDPs that have good memoryless policies, i.e., on problems in which there may well be very poor observability but there also exists a mapping from the agent’s immediate observations to actions that yield near-optimal return.”
  3. Paper is almost entirely empirical results

RL algorithm for POMDP Problems. Jaakkola, Singh, Jordan

  • Clearly, for POMPDs
  • Converges to local optima
  • Allows for stochastic policies (can be necessary for POMDPs)
  • Uses notation that isn’t defined anywhere, so the paper isn’t really readable.
  • Uses a discount factor that changes during the course of the calculation of the value function that converges to 1?  I’m not getting it.
  • Seems to function in a manner very similar to policy iteration, except the stochastic policy only moves ε towards the estimated optimal action at that state, instead of just locking exactly to the estimated optimal action at that state
  • I don’t really see how this approach removes the problem of non-stochastic policies breaking in POMDPs, as it seems if you let this run for enough iterations of policy improvement, it will give you something very close to the pure policy regular policy iteration will give you anyway.