Actor-Critic Reinforcement Learning with Energy-Based Policies. Heess, Silver, Teh. JMLR W&C EWRL 2012

  1. For high-D RL
  2. Policy gradient with restricted Boltzmann machines (RBMs)
  3. Builds on Sallans and Hinton, but the trick here is to not use Boltzmann machines for VFA as that diverges
  4. Actor-critic
  5. Converges to local optimum
  6. “We consider a class of policies based on energy-based models [LeCun et al., 2006], where the (negative) log probability of selecting an action is proportional to an energy function… They can learn deep, distributed representations of high-dimensional data (such as images) and model high-order dependencies, and here we will use them to directly parameterize representations over states and actions.”
  7. RBMs aren’t just for RL – they do supervised learning as well
  8. Energy-based polices for RL are difficult to use because:
    1. The optimization landscape is nonconvex
    2. “… it is often intractable to compute the partition function that normalizes the energy function into a probability distribution”
    3. Evaluating a policy has high variance
  9. Uses “… the natural gradient, which reduces the dependence of the performance of the policy gradient on the parameterization […]; and by using an actor-critic architecture, which uses an approximate value function to reduce the variance in the gradient estimates […].”
  10. 2 actor-critic algorithms are introduced
  11. “The direction of the vanilla policy gradient is sensitive to reparameterizations of the policy that don’t affect the action probabilities.  Natural policy gradient algorithms […] remove this dependence, by finding the direction that improves J(Θ) the most for a fixed small amount of change in distribution (e.g. measured by K.L. divergence).”
  12. The natural TD actor-critic algorithm (earlier work by other authors) has guarantees that “Provided that the average reward, TD error and critic are updated on slower time scale than for the actor, and step sizes are reduced at appropriate rates, NTDAC converges to a policy achieving near-local maximum average reward […].”
  13. <Props for doing their experiments right> “In a cross-validation style setup, we first choose the parameters based on a preliminary parameter sweep, then fix the parameters and perform a final run for which we report the results.”
  14. The two algorithms they introduce here significantly outperform competitors
  15. <Props on doing OCTOPUS ARM!  Although they just have actions as binary values on 4 components, so its still pretty drastically simplified, seems like they use this representation for the shape too, so its not really enormous>
    1. Also shaping is included
    2. They actually loose out slightly to neural networks here, although catch up by the end
    3. “Some ENATDAC runs were trapped in suboptimal maxima where the arm moves only in one direction, thus taking a longer time to reach the target on some episodes…”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: