Policy Gradient Methods for RL with Function Approximation. Sutton, Mchallester, Singh, Mansour

Argue that problem with trying to learn Q-function is that it only learns deterministic policies, and that small changes in the Q-function can lead to large changes in the policy, so we should try and use something that doesn’t have those properties.

Well, in a regular MDP, stochastic policies don’t help, and presumably, if the Q-functions of two actions are actually very close, it doesn’t really matter very much which is chosen, so these aren’t wonderful reasons to not want to use these methods

Method represents the encoding of a policy in terms of a parameter vector, Θ to some FA. For example, if representing a policy which can be linear in the feature vector, the parameters are the component weights

Claim converges to local optima and that small changes in Θ.

This is a little misleading because at best it converges to a local optima for policies in the hypothesis space the FA can represent, which can be quite limited. Must also depend on things like step size, etc

William’s invented the first real policy gradient method, but it learned slower than algorithms based on the value function (and probably crappy model free ones, at that), so it wasn’t really popular

The algorithm proposed is a generalization of REINFORCE

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy