- When you don’t have a potential function defined a-priori, is it possible to learn one while interacting with the environment and still do better?
- Two cases considered: model free (SARSA) as well as model-based (SARSA)
- Shaping learning performed for model-free is to use a lower resolution gridding for shaping, which can be learnt more rapidly
- For model based, use “free space assumption” <not sure what that means – Manhattan distance for navigation tasks?>
- In both cases, they use an abstracted task to do the shaping

- Extends Grzes, M., & Kudenko, D. (2008a). Multigrid reinforcement learning with reward shaping. In LNCS, Proceedings of the 18th international conference on artificial neural networks
- Shaping can be useful because VFAs can be used for the shaping, and even if this is unsafe, it will still not influence the final policy
- Ng’s work on shaping was for model-free methods, though John’s extended it to model-based, RMAXy settings
- Jonh’s showed that if the shaping function is admissible its PAC-MDP

- The “free space assumption” is used for real-time learning of heuristics (a-la LRTA*)
- “The free space assumption deals with initial uncertainty assuming that all actions in all states are unblocked.”

- “In the automatic shaping approach (Marthi, 2007) an abstract MDP is formulated and solved. In the initial phase of learning, the abstract MDP model is built and, after a defined number of episodes, the abstract MDP is solved exactly and its value function used as the potential function for ground states. In this paper, we propose an algorithm which applies a multigrid strategy…”
- The method used here (for the model free part) though, is model-free, doesn’t increase computational costs, and requires on minimal domain knowledge <knowledge of how to aggregate states is needed, although they may just glob together states based on the transition function>

- Here, the shaping function is simply a value function, but where each state is mapped to a set of states, and the value for that state is done according to the entire set (generalization to improve learning rate)
- On to the model-based part
- In effect, R-Max has a heuristic that each state can lead directly to the goal state. Instead, 1/(1-γ) can be replaced by a tighter (although still admissible) heuristic value. This still maintains guarantees of PAC-MDPness
- This is what Johns paper is about
- The algorithm is PAC-MDP iff heuristic is admissible

- The free state assumption says that if actions may fail in actual domain, ignore that possibility (for example, ignore walls in a navigation task)
- <I don’t really see how the model-based part of this paper contributes anything significant beyond John’s paper>
- Good source of references

Advertisements