Policy Gradient Methods for RL with Function Approximation. Sutton, Mchallester, Singh, Mansour

  • Argue that problem with trying to learn Q-function is that it only learns deterministic policies, and that small changes in the Q-function can lead to large changes in the policy, so we should try and use something that doesn’t have those properties.
    • Well, in a regular MDP, stochastic policies don’t help, and presumably, if the Q-functions of two actions are actually very close, it doesn’t really matter very much which is chosen, so these aren’t wonderful reasons to not want to use these methods
  • Method represents the encoding of a policy in terms of a parameter vector, Θ to some FA.  For example, if representing a policy which can be linear in the feature vector, the parameters are the component weights
  • Claim converges to local optima and that small changes in Θ.
    • This is a little misleading because at best it converges to a local optima for policies in the hypothesis space the FA can represent, which can be quite limited.  Must also depend on things like step size, etc
  • William’s invented the first real policy gradient method, but it learned slower than algorithms based on the value function (and probably crappy model free ones, at that), so it wasn’t really popular
  • The algorithm proposed is a generalization of REINFORCE

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: