- Deals with constructing functions for doing VFA based on adding dimensions that minimize the Bellman residual of the current representation
- Gives a nod to the proto-value function paper
- These basis functions are called Bellman Error Basis Functions (BEBFs)
- “… BEBFs form an orthonormal basis with guaranteed improvement in approximation quality at each iteration.”
- Representing the Bellman error can be just as hard as representing the value function, so this makes it seem like one hard problem is just being replaced with another one, but they show “… approximation quality can still improve even if there is significant error in the estimate of the Bellman error.”
- Paper deals with either the setting where the MDP is known a-priori, or when the algorithm has to run off of live samples. Naturally the first case is easier, and error is monotonically reducing
- Much of the paper assumes Markov chains and not MDPs
- Consider case where looking for a weighing of some set of basis functions φ
- Matrix Φ is formed with φ spanning columns and
*S*spanning the rows

- Matrix Φ is formed with φ spanning columns and
- The basic idea is to add another φ to the current set of
*k*basis functions is to set φ_{k+1}=*TV-V*

- In general this will be difficult to represent so some approximation must be used
- There is a trick to getting something close to this, is impacted by γ

- “Graph-based methods have the
*advantage*of being less directly connected to the reward and value function. The basis functions produced by these methods are derived from connectivity properties of the state space and could, potentially, be more generally applicable to a wide class of problems with similar state space connectivity but different reward functions. Graph-based methods have the*disadvantage*of being less directly connected to the reward and value functions. The disconnect from the reward and value function makes it difficult to guarantee a specific amount of progress with each iteration.” - Mentions another similar approach called
*matching pursuits* - In the first batch of results they just use the full model, which is admittedly not so useful as you can just solve the MDP for basically the same cost
- In the second set of results they build on LSPI
- LSPI normally passes on its basis functions to LSTD, which is run |
*A*| times. Here, new basis functions are recomputed at each policy evaluation step in LSPI

- LSPI normally passes on its basis functions to LSTD, which is run |
- <End up needing a pretty large number of basis functions, but they probably could have gotten by with less if they accepted more error>

Advertisements