Evolution Strategies for Direct Policy Search. Verena Heidrich-Meisner, Christian Igel. Parallel Problem Solving from Nature, 2008


  1. Paper discusses use of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for doing policy search.
    1. This algorithm is a more general form of cross entropy optimization (different rules for the weights can be used)
  2. Comparison is made to Natural Actor Critic (said to be the most sophisticated policy gradient method), and pure random search
  3. A couple of references that put evolutionary alogrithms for policy search ahead of alternative methods
    1. A significant difficulty in comparing policy search to value based methods is that they are so different, so selecting particular classes of FAs can bias comparison
    2. Evolutionary Function Approximation for Reinforcement Learning.  Whiteson, Stone.  JMLR 2006.
  4. Says gradient-based methods generally function poorly in the presence of noise, and domains where there are many poor local maxima
    1. Also claim that because evolution based methods only relatively weigh a number of candidates, they are more resilient to noise than gradient methods:
    2. Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem.  Same authors.  EWRL 2008
  5. A bunch of references for CMA-ES optimization (not RL applications)
    1. A Completely Derandomized Self-Adaptation in Evolution Strategies.  Hansen, Ostermeier.  Evolutionary Computation 2001
    2. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation.  2003
  6. CMA-ES was first used for RL in:
    1. Neuroevolution for Reinforcement Learning Using Evolution Strategies.  Congress on Evolutionary Computation.  2003
  7. The authors also have a couple other papers comparing CMA-ES to policy gradient methods, and CMA-ES is more robust
  8. For some settings of CMA-ES, random search performs better, policy gradient is worse
    1. They blame the failure of policy search on the fact that the fitness landscape has many bad plateaus of equally bad quality, so it ends up effectively doing a local random walk in policy space
  9. Say that aside from all the good qualities of CMA-ES policy gradient can outperform it if initialized close to the solution
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: