Linear Options. Sorg, Singh. AAMAS 2010.

  1. Shows that planning with linear expectation of model dynamics converges to a fixed point with low TD error
    1. Highly skeptical of the ability of this to work in practice without serious feature engineering (arriving at a fixed point does not mean low error)
  2. “…building on recent work on linear feature selection, we show conditions under which a linear feature set is sufficient for accurately representing the value of an option policy.  We extend this result to show conditions under which multiple options may be repeatedly composed to create new options with accurate linear models.”
  3. Also deals with state abstraction
  4. MAXQ also does state and temporal abstraction; options in continuous spaces naturally do so as well
  5. “However, the ability to plan using a temporally abstract model-one of the central advantages of temporal abstraction-remains a challenging problem when the number of states is large.” (This is only partially true.  Planning with options is almost identical to planning without them and in many cases makes planning much easier.  The difficulties of dealing with large state spaces is orthogonal to options, but fits into state abstraction which they do mention earlier).
  6. Define a semi-MDP version of LSTD.
  7. Has an algorithm LOEM for policy update?  Converges to same as SMDP-LSTD
  8. I think the policy updates are separate from the value estimates in this algorithm
  9. They also test a DYNAish version of this
  10. Does intra-option learning (updates values of actions performed while option is executing, and not just atomically at the end)
  11. Also discuss ability to learn policies using pseudoreward as opposed to true reward
  12. Oh then its extended to option models
  13. There is an error bound that is the product of 2 errors (looks like actually 3, the return of the option and then a term that combines option duration and outcome distribution)
    1. “However, these conditions apply only wrt a given behavioral policy.  A learning/planning agent will not know a priori what the optimal policy is, and it is ultimately the value of the optimal policy that is desired.”
    2. This is the problem with most proofs related to linear RL – they have to fit all policies that are encountered during learning; each time data is gained the policy changes (if directed exploration is used this is even more of a problem).  On the other hand, if you have too many features, the methods also suffer.  Its very tough to find the set of features that is flexible enough to represent policies during learning, as well as a near-optimal policy, without overfitting due to excess features
  14. They do try to define compositionality to help this, but I’m not sure how much it really goes to solving the problem
  15. Accuracy guarantees that hold for individual options are maintained when those that satisfy conditions are composed linearly
  16. Tested in a continuous navigation task (12 rooms) – needs 3472 features
  17. Tested linear MF and linear MB agents
  18. Testing had separate exploration and exploitation phases
  19. Not groking what options were learned/provided, but they were developed only based on the transition dynamics it looks like
  20. Abstracting over time and space yields best rewards

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: