- This looks right up my alley. Discusses policy search methods. Compares PI^2 to Cross-Entropy and CMAES.
- Then proposes a new algorithm PI^2-CMA, whose main advantage is that it determines magnitude of exploration noise automatically
- 2 continuous RL papers they cite are:
- Policy Search for Motor Primitives in Robotics. Kober, Peters. Machine Learning 2011.
- A Generalized Path Integral Control Approach to Reinforcement Learning. JMLR 2010.

- Says best algorithms now are Policy Improvment with Path Integrals (PI^2)
- The claim is that it outperforms REINFORCE and Natural Actor Critic by an order of magnitude in terms of speed and quality
- But at least REINFORCE is total garbage so that is not an impressive claim to be able to make

- PI2 is different from policy gradient methods in that it uses probability-weighted averaging to do the parameter update instead of an estimate of the gradient.
- CMAES (Covariance Matrix Adaptation-Evolutionary Strategy), as well as CEM (Cross-Entropy Methods) also use this basic idea

- Although all 3 algorithms have same basic component, each arrived at the rule from different principles
- By reinterpreting CEM as performing probability-weighted averaging, it can be shown that CEM is a special case of CMAES
- For the Cross-Entropy method, a particular form of probability-weighted averaging is performed, where the “good” samples get probability 1/
*k*(if there are*k*of them), and the “bad” samples get probability 0. - CEM for policy optimization was introduced in:
- The Cross-Entropy Method for Fast Policy Search. Mannor, Rubenstein, Gat. ICML 2003. This was primarily for finite MDPs, they mention the extension to continuous spaces

- Cross-Entropy in continuous spaces was more thoroughly introduced in:
- Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions. Busoniu, Ernst, Schutter, Be, Babuska. IEEE Transactions on Systems, Man, and Cybernetics. 2011.

- CEM has also been used with sampling-based motion planning:
- Learning Cost-efficient Control Policies with XCSF: Generalization capabilities and Further Improvement. Marin, Deock, Rigoux, Sigaud. Genetic and Evolutionary Computation

- CMAES is very similar to CEM, except it uses a more sophisticated method to update the covariance matrix
- There is also an extra variable that scales the covariance matrix and therefore controls exploration
- It also keeps track of the history of parameter changes and uses this “evolution path” to help speed convergence based on correlations between generations

- For a description of CMAES, see:
- Completely Derandomized Self-adaptation in Evolution Strategies. Hansen, Ostermeier. Evolutionary Computation 2001.

- For example of CMAES applied to double pole balancing, see:
- Evolution Strategies for Direct Policy Search. Heidrich-Meisnerm, Igel. Parallel Problem Solving from Nature 2008

- In empirical results, uses Dynamic Movement Primitives
- A Generalized Path Integral Control Approach to Reinforcement Learning. Theodorou, Buchli, Schaal. JMLR 2010

- “Using parameterized policies avoids the curse of dimensionality associated with (discrete) state-action spaces, and using probability-weighted averaging avoids having to estimate a gradient, which can be difficult for noisy and discontinuous cost functions”
- Seems like PI^2 has high computational costs relative to something like Cross-Entropy
- Also discusses PoWeR algorithm, but says performance is basically the same is PI^2
- “PI^2’s properties follow directly from first principles of stochastic optimal control.” On the other hand, CMAES and CEM have less formal backing
- The evaluation task is a time-dependent 10-DOF arm, previously used in the Theodorou paper
- Unlike in CEM/CMAES, the covariance matrix in PI^2 does not vary; only the mean does. This is done in order to have the derivation of PI^2 go through.
- But in this paper they ignore that and use adaptive covariance as well
- Which is what they call
**PI^2-CMA**, Path Integral Policy Improvement with Covariance Matrix Adaptation. - Forming PI^2-CMAES is analagous but slightly more complex

- PI^2 has much slower convergence than the variance proposed in this paper
- They have results for how initial parameterizations impact perf, doesn’t seem to matter too much