Videolecture at http://techtalks.tv/talks/compositional-planning-using-optimal-option-models/57443/

- Options are closed-loop subpolicies that have some terminating condition
- An option-model describes the distribution of states upon termination of the option
- Distinguishes between
*intra-option*model learning: constructing option models from primitive actions*inter-option planning*: using option models to construct a value function

- Option models can be composed to create more abstract option models
- In most work, the assumption is that options are created, then a value function is computed, this paper seems to interleave the process at multiple levels of hierarchy
- Sutton’s earlier work somewhat dealt with this, but it was in the off-policy, or Markov chain setting
- Precup’s work provided a method to compute the value function (and therefore policies) from option models, but not into other options
- In general, the options framework is agnostic to how options are created, but some algorithms that tackle option creation explicitly are MAXQ and HAM
- These algorithms, though, are designed for the “full” learning setting and not planning (which I suppose is what is being considered here).
- May be relying on full knowledge of dynamics instead of generative model.

- They work with Tower of Hanoi and a hierarchical path planning problem and say traditional methods have costs exponential in the size of the problem, but it’s polynomial, so my assumption is they mean it is exponential in a compact (as opposed to flat) representation
- Cite a paper that conducts “macro actions” that makes it poly time. Need to read.

- Looks like they are working in a setting that has reward distributions and impure policies
- Interestingly, the transition matrix as defined for options is discounted in the length of time the option takes before reaching the terminating condition. This is based on the interpretation that each step in between there is a 1-γ probability of termination. I’m sure its needed to make the math go through
- The math itself is fairly dense linear algebra – don’t have time to grok deeply at the moment
- Some statements not quite following at the moment?
- True value Model G- is a lower bound on all policies from all states. I’m sure I would find out why this is useful. They say it basically says that policy models dominate option models.
- I suppose a policy model is the (discounted?) distribution over states from a given start state?
- Its not clear to me how a model can dominate another model – they are both just estimates of distributions over states. These
*distributions*have different values, so I suppose its a bit of an abuse of language? - Oh I see, they define value models and say that policy models are value models

- A bit tough to read on first pass because a lot of standard RL stuff is defined a bit differently
- They rewrite the Bellman equation in terms of the math they defined and work from that
- Their generalization of VI, OOMI does not impose any option hierarchy; all option models may be composed of any other option model
- Say that even if OOMI is presented only with primitive actions it may converge in “significantly fewer iterations than VI”
- Has an order of magnitude savings over two other planning methods (much more sophisticated than VI) Action-policy model iteration, and Action-
**option**-policy model iteration- In deterministic Tower of Hanoi, the number of iterations is essentially just the number of disks, whereas other algs are exponential in the number of disks

- “This is the first MDP planning algorithm to dynamically create its own planning operators. These operators are composed together to give increasingly deep and purposeful jumps through state space.”

## Talk

- Talk gives the analogy of starting with bricks, composing brick laying to make a wall, and then composing walls to make buildings
- The composition of option models is just matrix multiplication
- Optimal option models (optimal at reaching some subgoal) make good macro-operators.
- Ah this notation is from Sutton
- Talk is a good quick overview of the paper
- One equation is pretty similar to the standard Bellman equation, in linear algebra form, then with an application of options, but then the equation is expressed in a different form
- This equation defines models that plan wrt to subgoal model. This is the Bellman equation that defines the optimality of an option wrt to provided termination condition,
**G**.- Calls this the
**optimal option**

- Calls this the
- The natural question then is where do the
**G**s come from? - When planning, use provided options while composing new options from those that exist
- Seems like the initial set of subgoals and models are provided a-priori, but perhaps they can just be built up from primitives?
- Like VI, its an iterative procedure
- Each iteration updates all option models
- This is beneficial because as iterations improve, ability to solve all subgoals improves, and all options can simultaneously take advantage of those improvements

- These model equations are contraction operators, so thats good
- Looks like the empirical results have manually defined subgoals, which does simlify the planning problem and requires domain expertise
- Of emprical competitors, one is a flat planner similar to VI, another composes options during an initial period and then plans with those options, and this algorithms interleaves progress