Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Sutton, Precup, Singh. Artificial Intelligence

  1. The options paper
  2. An option is a closed loop policy executed over some period of time.  They “enable temporally abstract knowledge and action to be included in the RL framework in a natural and general way. ”  They can also be interchanged with primitive actions in most algorithms such as Q-Learning
  3. Use of options produces a semi-MDP
  4. Options may be interrupted prematurely to produce better behavior than what is possible without premature stopping
  5. Intra-options are introduced that are able to learn about an option from fragmentary option execution
  6. Use of subgoals can improve options themselves
  7. Agnostic to state abstraction, hierarchy, function approximation, or macro-utility
  8. Temporal abstraction in planning have been used in STRIPS since the early 70s
  9. Options are defined to only be allowed to be started from particular states, and have a defined probability of terminating at every other state
  10. Options are here allowed to be semi-Markovian, in that a window of execution of the option may have an influence on termination and policy, as opposed to just current state
  11. Again, the transition probabilities of options are discounted into the future.  Here it is called a multi-time model
  12. If the models are correct, convergence is to the optimal policy (not surprising).  Says that other common methods of abstraction over space (as opposed to time) may prevent development of optimal policies even if their models are correct
  13. Introduces Bellman equation and form of value iteration (called SVI) for planning with options
  14. Example of how much more quickly VI converges when given good options to work from (although given a particularly good option which causes the solution of the problem it would simply converge immediately)
    1. Visually, it is shown that as soon as VI “backs up” to a state from where useful options can be executed (as opposed to just primitives), the next step of VI produces a large change in value estimate
  15. Optimistic value initialization (optimism in the face of uncertainty – important) can cause SVI to be computationally more expensive
  16. Call the issue of what happens when to many options are available the “utility problem”, and have some citations
  17. Also a Q-Learning defined for options
  18. Discuss idea that it is worthwhile to allow the agent to prematurely cease execution of an option if somewhere during its trajectory a state is reached that has a known better option (It terminates when the action it would take is known to be suboptimal).   Cites Kaelbling as the originator of this idea.  Not surprisingly this extension allows for better policies
  19. If this termination rule is used, as convergence is to optimal options may only be followed for one step and become redundant with primitives.  On the other hand, they help value spread rapidly during early iterations.  So the are more helpful in the beginning and less toward the end
  20. Have a problem which is essentially a continuous path planning problem.  Actions are infinite, but there are a finite number of options that already have hard-coded policies
    1. Also talks about interruption in this case, but its not clear how it deals with the continuous action space
  21. Method for building model of options, but options must be Markovian and not semi-Markov
    1. Naturally, he proposes TD to do this!
  22. The model building is “off-policy” in that any action selection that would be consistent with a policy (even if it came from a primitive action or other policy) is also used to update the estimates
  23. I really dont understand (or like) a number of choices in the paper
    1. Why are options and policies impure?  They don’t produce better policies/values and complicate the math
    2. Why are options semi-markov when standard policies are not?  If the domain is Markovian, again there is no need for this and it just complicates things.
    3. Furthermore, the semi-markov options are just thrown out.
    4. TD for model building is ridiculous
  24. Intra-option Q-Learning (when off-policy/off-option data is used) still converges not surprisingly
  25. End of the paper deals with terminal subgoal values
    1. These may vary to allow individual options to be more effective
    2. The implication of this and a value function reflecting this extension are provided

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: