- Argue that problem with trying to learn Q-function is that it only learns deterministic policies, and that small changes in the Q-function can lead to large changes in the policy, so we should try and use something that doesn’t have those properties.
- Well, in a regular MDP, stochastic policies don’t help, and presumably, if the Q-functions of two actions are actually very close, it doesn’t really matter very much which is chosen, so these aren’t wonderful reasons to not want to use these methods
- Method represents the encoding of a policy in terms of a parameter vector, Θ to some FA. For example, if representing a policy which can be linear in the feature vector, the parameters are the component weights
- Claim converges to local optima and that small changes in Θ.
- This is a little misleading because at best it converges to a local optima for policies in the hypothesis space the FA can represent, which can be quite limited. Must also depend on things like step size, etc
- William’s invented the first real policy gradient method, but it learned slower than algorithms based on the value function (and probably crappy model free ones, at that), so it wasn’t really popular
- The algorithm proposed is a generalization of REINFORCE