Policy Gradients with Parameter-Based Exploration for Control. Sehnke, Osendorfer, Rucksteiss, Graves, Peters, Schmidhuber

For continuous state, action spaces and stochastic policies

A general problem with policy gradient (PG) methods (such as REINFORCE) is that high variance (which seems to be due partly to stochastic policies) in the estimates of the gradient leads to slow convergence

In response to this, the policy gradients with parameter-based exploration (PGPE) is proposed which replaces the search in policy space with a search in model parameter space (isn’t this how its most commonly done anyway) and using these to estimate the “likelihood gradient wrt the parameters.” I don’t know what a likelihood gradient is, but OK – moving along.

PGPE estimates the parameter gradient directly (I guess from sampling) so it can be used to train non-differentiable controllers

PGPE attempts to fix the variance issue that arises from traditional PG methods by replacing the stochastic policy with a distribution over policy parameters θ. I don’t understand the claim they make here “…because the actions are deterministic [why is that?] and entire history can be generated using a single sample from the parameters, thereby reducing the variance in the gradient estimate”

Mentioned that in general parameter space is larger than the action space so it has higher sample complexity

They do well on some complex domains though (12 action dimensions, 32 state dimensions), but comparing against REINFORCE isn’t very reassuring – thats like comparing against q-learning.

Talk about a relationship to REINFORCE, ES, and SPSA which I know nothing about so I’m skipping