- The options paper
- An option is a closed loop policy executed over some period of time. They “enable temporally abstract knowledge and action to be included in the RL framework in a natural and general way. ” They can also be interchanged with primitive actions in most algorithms such as Q-Learning
- Use of options produces a semi-MDP
- Options may be interrupted prematurely to produce better behavior than what is possible without premature stopping
- Intra-options are introduced that are able to learn about an option from fragmentary option execution
- Use of subgoals can improve options themselves
- Agnostic to state abstraction, hierarchy, function approximation, or macro-utility
- Temporal abstraction in planning have been used in STRIPS since the early 70s
- Options are defined to only be allowed to be started from particular states, and have a defined probability of terminating at every other state
- Options are here allowed to be semi-Markovian, in that a window of execution of the option may have an influence on termination and policy, as opposed to just current state
- Again, the transition probabilities of options are discounted into the future. Here it is called a multi-time model
- If the models are correct, convergence is to the optimal policy (not surprising). Says that other common methods of abstraction over space (as opposed to time) may prevent development of optimal policies even if their models are correct
- Introduces Bellman equation and form of value iteration (called SVI) for planning with options
- Example of how much more quickly VI converges when given good options to work from (although given a particularly good option which causes the solution of the problem it would simply converge immediately)
- Visually, it is shown that as soon as VI “backs up” to a state from where useful options can be executed (as opposed to just primitives), the next step of VI produces a large change in value estimate
- Optimistic value initialization (optimism in the face of uncertainty – important) can cause SVI to be computationally more expensive
- Call the issue of what happens when to many options are available the “utility problem”, and have some citations
- Also a Q-Learning defined for options
- Discuss idea that it is worthwhile to allow the agent to prematurely cease execution of an option if somewhere during its trajectory a state is reached that has a known better option (It terminates when the action it would take is known to be suboptimal). Cites Kaelbling as the originator of this idea. Not surprisingly this extension allows for better policies
- If this termination rule is used, as convergence is to optimal options may only be followed for one step and become redundant with primitives. On the other hand, they help value spread rapidly during early iterations. So the are more helpful in the beginning and less toward the end
- Have a problem which is essentially a continuous path planning problem. Actions are infinite, but there are a finite number of options that already have hard-coded policies
- Also talks about interruption in this case, but its not clear how it deals with the continuous action space
- Method for building model of options, but options must be Markovian and not semi-Markov
- Naturally, he proposes TD to do this!
- The model building is “off-policy” in that any action selection that would be consistent with a policy (even if it came from a primitive action or other policy) is also used to update the estimates
- I really dont understand (or like) a number of choices in the paper
- Why are options and policies impure? They don’t produce better policies/values and complicate the math
- Why are options semi-markov when standard policies are not? If the domain is Markovian, again there is no need for this and it just complicates things.
- Furthermore, the semi-markov options are just thrown out.
- TD for model building is ridiculous
- Intra-option Q-Learning (when off-policy/off-option data is used) still converges not surprisingly
- End of the paper deals with terminal subgoal values
- These may vary to allow individual options to be more effective
- The implication of this and a value function reflecting this extension are provided