Analyzing Feature Generation fro Value-Function Approximation. Parr, Painter-Wakefield, Li, Littman. ICML 2007

  1. Deals with constructing functions for doing VFA based on adding dimensions that minimize the Bellman residual of the current representation
  2. Gives a nod to the proto-value function paper
  3. These basis functions are called Bellman Error Basis Functions (BEBFs)
  4. “… BEBFs form an orthonormal basis with guaranteed improvement in approximation quality at each iteration.”
  5. Representing the Bellman error can be just as hard as representing the value function, so this makes it seem like one hard problem is just being replaced with another one, but they show “… approximation quality can still improve even if there is significant error in the estimate of the Bellman error.”
  6. Paper deals with either the setting where the MDP is known a-priori, or when the algorithm has to run off of live samples.  Naturally the first case is easier, and error is monotonically reducing
  7. Much of the paper assumes Markov chains and not MDPs
  8. Consider case where looking for a weighing of some set of basis functions φ
    1. Matrix Φ is formed with φ spanning columns and S spanning the rows
  9. The basic idea is to add another φ to the current set of k basis functions is to set φ k+1 = TV-V

    1. In general this will be difficult to represent so some approximation must be used
    2. There is a trick to getting something close to this, is impacted by γ
  10. “Graph-based methods have the advantage of being less directly connected to the reward and value function.  The basis functions produced by these methods are derived from connectivity properties of the state space and could, potentially, be more generally applicable to a wide class of  problems with similar state space connectivity but different reward functions.  Graph-based methods have the disadvantage of being less directly connected to the reward and value functions.  The disconnect from the reward and value function makes it difficult to guarantee a specific amount of progress with each iteration.”
  11. Mentions another similar approach called matching pursuits
  12. In the first batch of results they just use the full model, which is admittedly not so useful as you can just solve the MDP for basically the same cost
  13. In the second set of results they build on LSPI
    1. LSPI normally passes on its basis functions to LSTD, which is run |A|  times.  Here, new basis functions are recomputed at each policy evaluation step in LSPI
  14. <End up needing a pretty large number of basis functions, but they probably could have gotten by with less if they accepted more error>

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: