- Shows that planning with linear expectation of model dynamics converges to a fixed point with low TD error
- Highly skeptical of the ability of this to work in practice without serious feature engineering (arriving at a fixed point does not mean low error)

- “…building on recent work on linear feature selection, we show conditions under which a linear feature set is sufficient for accurately representing the value of an option policy. We extend this result to show conditions under which multiple options may be repeatedly composed to create new options with accurate linear models.”
- Also deals with state abstraction
- MAXQ also does state and temporal abstraction; options in continuous spaces naturally do so as well
- “However, the ability to
*plan*using a temporally abstract model-one of the central advantages of temporal abstraction-remains a challenging problem when the number of states is large.” (This is only partially true. Planning with options is almost identical to planning without them and in many cases makes planning much easier. The difficulties of dealing with large state spaces is orthogonal to options, but fits into state abstraction which they do mention earlier). - Define a semi-MDP version of LSTD.
- Has an algorithm LOEM for policy update? Converges to same as SMDP-LSTD
- I think the policy updates are separate from the value estimates in this algorithm
- They also test a DYNAish version of this
- Does intra-option learning (updates values of actions performed while option is executing, and not just atomically at the end)
- Also discuss ability to learn policies using pseudoreward as opposed to true reward
- Oh then its extended to option models
- There is an error bound that is the product of 2 errors (looks like actually 3, the return of the option and then a term that combines option duration and outcome distribution)
- “However, these conditions apply only wrt a given behavioral policy. A learning/planning agent will not know a priori what the optimal policy is, and it is ultimately the value of the optimal policy that is desired.”
- This is the problem with most proofs related to linear RL – they have to fit all policies that are encountered during learning; each time data is gained the policy changes (if directed exploration is used this is even more of a problem). On the other hand, if you have too many features, the methods also suffer. Its very tough to find the set of features that is flexible enough to represent policies during learning, as well as a near-optimal policy, without overfitting due to excess features

- They do try to define compositionality to help this, but I’m not sure how much it really goes to solving the problem
- Accuracy guarantees that hold for individual options are maintained when those that satisfy conditions are composed linearly
- Tested in a continuous navigation task (12 rooms) – needs 3472 features
- Tested linear MF and linear MB agents
- Testing had separate exploration and exploitation phases
- Not groking what options were learned/provided, but they were developed only based on the transition dynamics it looks like
- Abstracting over time and space yields best rewards