Evolution Strategies for Direct Policy Search. Verena Heidrich-Meisner, Christian Igel. Parallel Problem Solving from Nature, 2008 | Ari Weinstein's Research

Evolution Strategies for Direct Policy Search. Verena Heidrich-Meisner, Christian Igel. Parallel Problem Solving from Nature, 2008

Paper discusses use of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for doing policy search.

This algorithm is a more general form of cross entropy optimization (different rules for the weights can be used)

Comparison is made to Natural Actor Critic (said to be the most sophisticated policy gradient method), and pure random search

A couple of references that put evolutionary alogrithms for policy search ahead of alternative methods

A significant difficulty in comparing policy search to value based methods is that they are so different, so selecting particular classes of FAs can bias comparison

Evolutionary Function Approximation for Reinforcement Learning. Whiteson, Stone. JMLR 2006.

Says gradient-based methods generally function poorly in the presence of noise, and domains where there are many poor local maxima

Also claim that because evolution based methods only relatively weigh a number of candidates, they are more resilient to noise than gradient methods:

Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem. Same authors. EWRL 2008

A bunch of references for CMA-ES optimization (not RL applications)

A Completely Derandomized Self-Adaptation in Evolution Strategies. Hansen, Ostermeier. Evolutionary Computation 2001

Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation. 2003

CMA-ES was first used for RL in:

Neuroevolution for Reinforcement Learning Using Evolution Strategies. Congress on Evolutionary Computation. 2003

The authors also have a couple other papers comparing CMA-ES to policy gradient methods, and CMA-ES is more robust

For some settings of CMA-ES, random search performs better, policy gradient is worse

They blame the failure of policy search on the fact that the fitness landscape has many bad plateaus of equally bad quality, so it ends up effectively doing a local random walk in policy space

Say that aside from all the good qualities of CMA-ES policy gradient can outperform it if initialized close to the solution